Neural Headline Generation on Abstract Meaning Representation

Neural network-based encoder-decoder models are among recent attractive methodologies for tackling natural language generation tasks. This paper investigates the usefulness of structural syntactic and semantic information additionally incorporated in a baseline neural attention-based model. We encode results obtained from an abstract meaning representation (AMR) parser using a modiﬁed version of Tree-LSTM. Our proposed attention-based AMR encoder-decoder model improves head-line generation benchmarks compared with the baseline neural attention-based model.


Introduction
Neural network-based encoder-decoder models are cutting-edge methodologies for tackling natural language generation (NLG) tasks, i.e., machine translation (Cho et al., 2014), image captioning (Vinyals et al., 2015), video description (Venugopalan et al., 2015), and headline generation (Rush et al., 2015). This paper also shares a similar goal and motivation to previous work: improving the encoderdecoder models for natural language generation. There are several directions for enhancement. This paper respects the fact that NLP researchers have expended an enormous amount of effort to develop fundamental NLP techniques such as POS tagging, dependency parsing, named entity recognition, and semantic role labeling. Intuitively, this structural, syntactic, and semantic information underlying input text has the potential for improving the quality of NLG tasks. However, to the best of our knowledge, there is no clear evidence that syntactic and semantic information can enhance the recently developed encoder-decoder models in NLG tasks.
To answer this research question, this paper proposes and evaluates a headline generation method based on an encoder-decoder architecture on Abstract Meaning Representation (AMR). The method is essentially an extension of attention-based summarization (ABS) (Rush et al., 2015). Our proposed method encodes results obtained from an AMR parser by using a modified version of Tree-LSTM encoder (Tai et al., 2015) as additional information of the baseline ABS model. Conceptually, the reason for using AMR for headline generation is that information presented in AMR, such as predicate-argument structures and named entities, can be effective clues when producing shorter summaries (headlines) from original longer sentences. We expect that the quality of headlines will improve with this reasonable combination (ABS and AMR).

Attention-based summarization (ABS)
ABS proposed in Rush et al. (2015) has achieved state-of-the-art performance on the benchmark data of headline generation including the DUC-2004 dataset (Over et al., 2007). Figure 1 illustrates the model structure of ABS. The model predicts a word sequence (summary) based on the combination of the neural network language model and an input sentence encoder.
Let V be a vocabulary. x i is the i-th indicator vector corresponding to the i-th word in the input sentence. Suppose we have M words of an input sentence. X represents an input sentence, which <s> canadian prime … year <s> canada … nato is represented as a sequence of indicator vectors, whose length is M . That is, x i ∈ {0, 1} |V | , and X = (x 1 , . . . , x M ). Similarly, let Y represent a sequence of indicator vectors Y = (y 1 , . . . , y L ), whose length is L. Here, we assume L < M . Y C,i is a short notation of the list of vectors, which consists of the sub-sequence in Y from y i−C+1 to y i . We assume a one-hot vector for a special start symbol, such as "⟨S⟩", when i < 1. Then, ABS outputs a summaryŶ given an input sentence X as follows: where nnlm(Y C,i ) is a feed-forward neural network language model proposed in (Bengio et al., 2003), and enc(X, Y C,i ) is an input sentence encoder with attention mechanism. This paper uses D and H as denoting sizes (dimensions) of vectors for word embedding and hidden layer, respectively. Let E ∈ R D×|V | be an embedding matrix of output words. Moreover, let U ∈ R H×(CD) and O ∈ R |V |×H be weight matrices of hidden and output layers, respectively 1 . Using the above notations, nnlm(Y C,i ) in Equation 3 can be written as follows: whereỹ c is a concatenation of output embedding vectors from i − C + 1 to i, that is,ỹ c = (Ey i−C+1 · · · Ey i ). Therefore,ỹ c is a (CD) dimensional vector.
Next, F ∈ R D×|V | and E ′ ∈ R D×|V | denote embedding matrices of input and output words, respectively. O ′ ∈ R |V |×D is a weight matrix for the output layer. P ∈ R D×(CD) is a weight matrix for mapping embedding of C output words onto embedding of input words.X is a matrix form of a list of input embeddings, namely, whereỹ ′ c is a concatenation of output embedding vectors from i − C + 1 to i similar toỹ c , that is, Qx q . Equation 6 is generally referred to as the attention model, which is introduced to encode a relationship between input words and the previous C output words. For example, if the previous C output words are assumed to align to x i , then the surrounding Q words (x i−Q , . . . , x i+Q ) are highly weighted by Equation 5.

Proposed Method
Our assumption here is that syntactic and semantic features of an input sentence can greatly help for generating a headline. For example, the meanings of subjects, predicates, and objects in a generated headline should correspond to the ones appearing in an input sentence. Thus, we incorporate syntactic and semantic features into the framework of headline generation. This paper uses an AMR as a case study of the additional features.

AMR
An AMR is a rooted, directed, acyclic graph that encodes the meaning of a sentence. Nodes in an AMR graph represent 'concepts', and directed edges represent a relationship between nodes. Concepts consist of English words, PropBank event predicates, and special labels such as "person". For edges, AMR has approximately 100 relations (Banarescu et al., 2013) including semantic roles based on the PropBank annotations in OntoNotes (Hovy et al., 2006). To acquire AMRs for input sentences, we use the state-of-the-art transition-based AMR parser (Wang et al., 2015). Figure 2 shows a brief sketch of the model structure of our attention-based AMR encoder model. We utilize a variant of child-sum Tree-LSTM originally proposed in (Tai et al., 2015) to encode syntactic and semantic information obtained from output of the AMR parser into certain fixed-length embedding vectors. To simplify the computation, we transform a DAG structure of AMR parser output to a tree structure, which we refer to as "tree-converted AMR structure". This transformation can be performed by separating multiple head nodes, which often appear for representing coreferential concepts, to a corresponding number of out-edges to head nodes. Then, we straightforwardly modify Tree-LSTM to also encode edge labels since AMR provides both node and edge labels, and original Tree-LSTM only encodes node labels.

Attention-Based AMR Encoder
Let n j and e j be N and E dimensional embeddings for labels assigned to the j-th node, and the out-edge directed to its parent node 2 . W in , W f n , W on , W un ∈ R D×N are weight matrices 2 We prepare a special edge embedding for a root node. for node embeddings n j 3 . Similarly, W ie , W f e , W oe , W ue ∈ R D×E are weight matrices for edge embeddings e j . W ih , W f h , W oh , W uh ∈ R D×D are weight matrices for output vectors connected from child nodes. B(j) represents a set of nodes, which have a direct edge to the j-th node in our treeconverted AMR structure. Then, we define embedding a j obtained at node j in tree-converted AMR structure via Tree-LSTM as follows: Let J represent the number of nodes in treeconverted AMR structure obtained from a given input sentence. We introduce A ∈ R D×J as a matrix form of a list of hidden states a j for all j, namely, A = [a 1 , . . . , a J ]. Let O ′′ ∈ R |V |×D be a weight matrix for the output layer. Let S ∈ R D×(CD) be a weight matrix for mapping the context embedding of C output words onto embeddings obtained from nodes. Then, we define the attention-based AMR encoder 'encAMR(A, Y C,i )' as follows: Finally, we combine our attention-based AMR encoder shown in Equation 14 as an additional term of Equation 3 to build our headline generation system.

Experiments
To demonstrate the effectiveness of our proposed method, we conducted experiments on benchmark data of the abstractive headline generation task described in Rush et al. (2015).

DUC-2004
Gigaword test data used Gigaword in (Rush et al., 2015) Our sampled test data Method R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L ABS (Rush et al., 2015)  For a fair comparison, we followed their evaluation setting. The training data was obtained from the first sentence and the headline of a document in the annotated Gigaword corpus (Napoles et al., 2012) 4 . The development data is DUC-2003 data, and test data are both DUC-2004(Over et al., 2007 and sentence-headline pairs obtained from the annotated Gigaword corpus as well as training data 5 . All of the generated headlines were evaluated by ROUGE (Lin, 2004) 6 . For evaluation on DUC-2004, we removed strings after 75-characters for each generated headline as described in the DUC-2004 evaluation. For evaluation on Gigaword, we forced the system outputs to be at most 8 words as in Rush et al. (2015) since the average length of headline in Gigaword is 8.3 words. For the preprocessing for all data, all letters were converted to lower case, all digits were replaced with '#', and words appearing less than five times with 'UNK'. Note that, for further evaluation, we prepared 2,000 sentence-headline pairs randomly sampled from the test data section of the Gigaword corpus as our additional test data.
In our experiments, we refer to the baseline neural attention-based abstractive summarization method described in Rush et al. (2015) as "ABS", and our proposed method of incorporating AMR structural information by a neural encoder to the baseline method described in Section 3 as "ABS+AMR". Additionally, we also evaluated the performance of the AMR encoder without the attention mechanism, which we refer to as "ABS+AMR(w/o attn)", to investigate the contribution of the attention mechanism on the AMR encoder. For the parameter estimation (training), we used stochastic gradient descent to learn parameters. We tried several values for the initial learning rate, and selected the value that achieved the best performance for each method. We decayed the learning rate by half if the log-likelihood on the validation set did not improve for an epoch. Hyper-parameters we selected were D = 200, H = 400, N = 200, E = 50, C = 5, and Q = 2. We re-normalized the embedding after each epoch (Hinton et al., 2012).
For ABS+AMR, we used the two-step training scheme to accelerate the training speed. The first phase learns the parameters of the ABS. The second phase trains the parameters of the AMR encoder by using 1 million training pairs while the parameters of the baseline ABS were fixed and unchanged to prevent overfitting. Table 1 shows the recall of ROUGE (Lin, 2004) on each dataset. ABS (re-run) represents the performance of ABS re-trained by the distributed scripts 7 . We can see that the proposed method, ABS+AMR, outperforms the baseline ABS on all datasets. In particular, ABS+AMR achieved statistically significant gain from ABS (re-run) for ROUGE-1 and ROUGE-2 on DUC-2004. However in contrast, we observed that the improvements on Gigaword (the same test data as Rush et al. (2015)) seem to be limited compared with the DUC-2004 dataset. We assume that this limited gain is caused largely by the quality of AMR parsing results. This means that the 7 https://github.com/facebook/NAMAS I(1): crown prince abdallah ibn abdel aziz left saturday at the head of saudi arabia 's delegation to the islamic summit in islamabad , the official news agency spa reported . G: saudi crown prince leaves for islamic summit A: crown prince leaves for islamic summit in saudi arabia P: saudi crown prince leaves for islamic summit in riyadh I(2): a massive gothic revival building once christened the lunatic asylum west of the <unk> was auctioned off for $ #.# million -lrb-euro# .# million -rrb-. G: massive ##th century us mental hospital fetches $ #.# million at auction A: west african art sells for $ #.# million in P: west african art auctioned off for $ #.# million I(3): brooklyn , the new bastion of cool for many new yorkers , is poised to go mainstream chic . G: high-end retailers are scouting sites in brooklyn A: new yorkers are poised to go mainstream with chic P: new york city is poised to go mainstream chic Gigaword test data provided by Rush et al. (2015) is already pre-processed. Therefore, the quality of the AMR parsing results seems relatively worse on this pre-processed data since, for example, many low-occurrence words in the data were already replaced with "UNK". To provide evidence of this assumption, we also evaluated the performance on our randomly selected 2,000 sentence-headline test data also taken from the test data section of the annotated Gigaword corpus. "Gigaword (randomly sampled)" in Table 1 shows the results of this setting. We found the statistical difference between ABS(re-run) and ABS+AMR on ROUGE-1 and ROUGE-2.
We can also observe that ABS+AMR achieved the best ROUGE-1 scores on all of the test data. According to this fact, ABS+AMR tends to successfully yield semantically important words. In other words, embeddings encoded through the AMR encoder are useful for capturing important concepts in input sentences. Figure 3 supports this observation. For example, ABS+AMR successfully added the correct modifier 'saudi' to "crown prince" in the first example. Moreover, ABS+AMR generated a consistent subject in the third example.
The comparison between ABS+AMR(w/o attn) and ABS+AMR (with attention) suggests that the attention mechanism is necessary for AMR encoding. In other words, the encoder without the attention mechanism tends to be overfitting.

Related Work
Recently, the Recurrent Neural Network (RNN) and its variant have been applied successfully to various NLP tasks. For headline generation tasks, Chopra et al. (2016) exploited the RNN decoder (and its variant) with the attention mechanism instead of the method of Rush et al. (2015): the combination of the feed-forward neural network language model and attention-based sentence encoder.  also adapted the RNN encoder-decoder with attention for headline generation tasks. Moreover, they made some efforts such as hierarchical attention to improve the performance. In addition to using a variant of RNN,  proposed a method to handle infrequent words in natural language generation. Note that these recent developments do not conflict with our method using the AMR encoder. This is because the AMR encoder can be straightforwardly incorporated into their methods as we have done in this paper, incorporating the AMR encoder into the baseline. We believe that our AMR encoder can possibly further improve the performance of their methods. We will test that hypothesis in future study.

Conclusion
This paper mainly discussed the usefulness of incorporating structural syntactic and semantic information into novel attention-based encoder-decoder models on headline generation tasks. We selected abstract meaning representation (AMR) as syntactic and semantic information, and proposed an attention-based AMR encoder-decoder model. The experimental results of headline generation benchmark data showed that our attention-based AMR encoder-decoder model successfully improved standard automatic evaluation measures of headline generation tasks, ROUGE-1, ROUGE-2, and ROUGE-L. We believe that our results provide empirical evidence that syntactic and semantic information obtained from an automatic parser can help to improve the neural encoder-decoder approach in NLG tasks.