Knowledge Graph-Augmented Abstractive Summarization with Semantic-Driven Cloze Reward

Sequence-to-sequence models for abstractive summarization have been studied extensively, yet the generated summaries commonly suffer from fabricated content, and are often found to be near-extractive. We argue that, to address these issues, the summarizer should acquire semantic interpretation over input, e.g., via structured representation, to allow the generation of more informative summaries. In this paper, we present ASGARD, a novel framework for Abstractive Summarization with Graph-Augmentation and semantic-driven RewarD. We propose the use of dual encoders—a sequential document encoder and a graph-structured encoder—to maintain the global context and local characteristics of entities, complementing each other. We further design a reward based on a multiple choice cloze test to drive the model to better capture entity interactions. Results show that our models produce significantly higher ROUGE scores than a variant without knowledge graph as input on both New York Times and CNN/Daily Mail datasets. We also obtain better or comparable performance compared to systems that are fine-tuned from large pretrained language models. Human judges further rate our model outputs as more informative and containing fewer unfaithful errors.


Introduction
Abstractive summarization aims to produce concise and informative summaries with the goal of promoting efficient information consumption and knowledge acquisition (Luhn, 1958).Significant progress has been made in this area by designing sequence-to-sequence-based neural models for single-document abstractive summarization (Gehrmann et al., 2018;Liu et al., 2018;Liu and Lapata, 2019).However, due to the limitations of model structure and word prediction-based Input Article of New York Times: John M. Fabrizi, the mayor of Bridgeport, admitted on Tuesday that he had used cocaine and abused alcohol while in office.Mr. Fabrizi, who was appointed mayor in 2003 after the former mayor, Joseph P. Ganim, went to prison on corruption charges, said he had sought help for his drug problem about 18 months ago and that he had not used drugs since.About four months ago, he added, he stopped drinking alcohol.

Summary by Human:
The Week column.Mayor John Fabrizi of Brigeport, Conn, publicly admits he used cocaine and abused alcohol while in office; says he stopped drinking alcohol and sought help for his drug problem about 18 months ago.learning objectives, these models frequently produce unfaithful content (Cao et al., 2018) and nearextractive summaries (See et al., 2017;Kryściński et al., 2018).These observations suggest that existing models lack semantic interpretation over the input, which is critical for summarization.We argue that the generation of informative and succinct abstracts requires structured representation to facilitate the connection of relevant subjects, and the preservation of global context, e.g.entity interactions and topic flows.Take Fig. 1 as an ex-ample.Complex events related with the same entity may span multiple sentences, making it challenging for existing sequential models to capture.A graph representation, on the contrary, produces a structured summary and highlights the proximity of relevant concepts.
To this end, we present ASGARD, a framework for Abstractive Summarization with Graph-Augmentation and semantic-driven RewarD. 1 Under the encoder-decoder framework, we enhance the regular document encoder with a separate graph-structured encoder to maintain the global context and local characteristics of entities by using the outputs from an open information extraction (OpenIE) system.
Specifically, we experiment with two graph variants, one mainly capturing entities' document-level interactions and the other reflecting such interactions within each paragraph plus topic shifts across paragraphs.Both graphs can capture interactions among entities that are positioned far from one another in the document and significantly reduce redundancy, as shown in Fig. 1.The document encoder and the graph encoder then cooperate during abstract generation, wherein the model is trained to identify salient content by aligning graphs with human summaries.Though structured representation has been studied before for summarization (Fernandes et al., 2019), to the best of our knowledge, we are the first to utilize graph neural networks to explicitly encode entity-centered information for abstractive summary generation.
Moreover, we propose a novel multi-choice cloze reward to drive the model to acquire semantic understanding over the input.Concretely, we design cloze questions by removing pairwise entities that are connected with a predicate or co-occur in a human summary sentence, whereas prior work only considers single entities to construct questions (Eyal et al., 2019).In tandem with our graph encoding of knowledge, the cloze reward further facilitates the acquisition of global entity interactions with reinforcement learning.
We carry out automatic and human evaluations on popular summarization datasets.Models based on ASGARD yield significantly better ROUGE scores (Lin and Hovy, 2003) than a variant without access to the knowledge graph on two popular news summarization datasets, New York Times corpus and CNN/Daily Mail dataset.Moreover, ASGARD models attain performance better than or comparable to others that are fine-tuned from large pretrained language models, including BERT-Sum (Liu and Lapata, 2019), UniLM (Dong et al., 2019), and BART (Lewis et al., 2019).Human judges further confirm that our models generate more informative summaries with less unfaithful errors than their counterparts without the graph encoder.Importantly, we find that automatic evaluation metrics only weakly correlate with these errors, implying that new evaluation methods are needed to better gauge summary quality.
The rest of the paper is organized as follows.We describe related work in the next section ( § 2).We then discuss the knowledge graph construction in § 3 and formulate our graph-augmented summarization framework in § 4. In § 5, we introduce reinforcement learning with cloze reward.Experiments and results are presented in § 6 and § 7. Finally, we conclude in § 8.

Related Work
Graph-Augmented Summarization and Generation.Graph structures have long been used for extractive summarization, such as in Textrank (Mihalcea and Tarau, 2004) and Lexrank (Erkan and Radev, 2004).For neural models,  (Veličković et al., 2018), to capture the global context in a more effective manner.Also related is the graph-to-sequence framework that has been adopted for text generation (Song et al., 2018).Both Gated Graph Neural Networks (GGNNs) (Beck et al., 2018) and Graph Convolutional Networks (GCNs) (Damonte and Cohen, 2019) (Paulus et al., 2018;Chen and Bansal, 2018;Sharma et al., 2019).However, ROUGE does not always distinguish good summaries from bad ones (Novikova et al., 2017), and ignores entity interactions.
Since question answering (QA) has been used for summary evaluation (Narayan et al., 2018), and is shown to correlate with human judgment of summaries qualities (Eyal et al., 2019), QA-based rewards have been studied for summarization model training.Arumae and Liu (2019) demonstrate that using fill-in-the-blank questions by removing entities or root words leads to improved content selection.Scialom et al. (2019) consider a similar setup, but use both F1 score and QA system confidence as rewards in abstractive summarization.Previous work, however, mainly focuses on single entities or words in human-written summaries, thereby losing contexts and relations.Moreover, fill-in-the-blank questions by prior work give credits only when the answers exactly match the ground-truths, thus causing inaccuracies for rephrased answers and discouraging abstract content generation.In contrast, we design a semantic-driven cloze reward by measuring how well a QA system can address multiple choice cloze questions which better encode entity interactions and handle paraphrased answers.

Knowledge Graph Construction
To construct a knowledge graph from an input document, we utilize Stanford CoreNLP (Manning et al., 2014) to first obtain outputs from coreference resolution and open information extraction (OpenIE) models (Angeli et al., 2015).Note that we do not conduct global entity linking across documents.Next, we take the subject, predicate, object triples extracted by OpenIE and remove any triple whose argument (subject or object) has more than 10 words.If two triples differ only by one argument, and the arguments overlap, we keep the longer triple.We begin constructing the graph by treating subjects and objects as nodes connected by directed edges, with predicates as attributes.We further collapse coreferential mentions of the same entity into one node.With this, we can localize salient content related to each entity as well as make connections of spread-out entities through graph paths.

Summarization Model
In this section, we describe our graph-augmented abstractive summarization framework, as displayed in Fig. 2. Our model takes as input a document, represented as a sequence of tokens x = {x k }, and a knowledge graph G consisting of nodes {v i }. x and G are separately consumed by a document encoder and a graph encoder, as presented in § 4.1.Importantly, we present two types of graphs: DOC-GRAPH, focusing on the global context, and SEG-GRAPH, which additionally captures topic shift.The summary decoder then generates an abstractive summary by attending to both the document and the graph ( § 4.2).In § 4.3, we formulate a maximum likelihood training objective which leverages the detection of salient nodes in the graph.

Encoders
Document Encoder.We first feed input x to RoBERTa (Liu et al., 2019) and take the last layer output as token embeddings.We then employ a single-layer bidirectional LSTM (BiLSTM) over token embeddings, producing encoder hidden states h k at time step k.
Graph Encoder.Built on the graph constructed in § 3, we create nodes for predicates as done in previous graph-to-sequence work (Beck et al., 2018) to reduce model parameters.Directed, unlabeled edges are added from subject to predicate, and from predicate to object.We further add reverse edges and self-loops to enhance the information flow, and this forms the graph G.
Node Initialization.Each node often contains multiple mentions of an entity; we thus initialize node representation v i by using the average embedding of its tokens.We leverage document encoder hidden states h k as the contextual representation of tokens.Number of mentions in the node is added as an extra encoding to v i , to signify entity salience.
Contextualized Node Encoding.Our graph encoder improves upon Graph Attention Networks (GATs) (Veličković et al., 2018) by adding residual connections between layers as discussed in Koncel-Kedziorski et al. (2019).Each node v i is represented by a weighted average of its neighbors: α n i,j W 0,n v j (1) where N n=1 denotes the concatenation of N heads, each producing a vector of the same dimension as v i .We use N = 4 in our experiments with two layers of GATs.N (v i ) denotes the neighbors of v i in graph G. W * are trainable parameters.
The graph encoder described above encodes document-level global context by merging entity mentions throughout the document and capturing their interactions with graph paths.It is henceforth denoted as DOCGRAGH.
Encoder Extension to Capture Topic Shift (SEGGRAGH).Modeling topic transitions and recurrences enables the identification of notable content, thus benefiting summarization (Barzilay and Lee, 2004).Since paragraphs naturally divide a document into different topic segments, we extend DocGragh by first encoding each paragraph as a subgraph G p (for the p-th paragraph) using the same graph encoder, and then connecting all subgraphs with a BiLSTM.If two nodes in separate subgraphs refer to the same entity, they are initial-ized with the same embedding (as in the first occurrence).Concretely, we first apply max-pooling over all nodes in subgraph G p from the outputs of the final GAT layer; the max-pooling results are then used as inputs for a BiLSTM to produce the final subgraph representation h g p for G p .

Summary Decoder
Our summary decoder uses a single-layer unidirectional LSTM with a hidden state s t at step t; it generates summary tokens recurrently by jointly attending to the input document and the graph.
Attending the Graph.At each decoding step t, we compute a graph context vector c v t with the attention mechanism (Bahdanau et al., 2014): where u * are also trainable parameters.We omit bias terms for simplicity.
Attending the Document.Similarly, the document context c t is computed over input tokens by additionally considering the graph context c v t : Token Prediction.Graph and document context vectors, treated as salient content summarized from both sources, are concatenated with the decoder hidden state s t to produce the vocabulary distribution P vocab : We use weight-sharing between the input embedding matrix and the matrix W out to allow reusing linguistic knowledge as proposed by Paulus et al. (2018).We further add a copy mechanism similar to See et al. (2017), with copy probability as: where y t−1 denotes the embedding for the token predicted at step t − 1. Modified Hierarchical Attention for SegGraph.
As mentioned in § 4.1, SegGraph captures content salience by modeling topic shift across paragraphs.We thus seek to leverage paragraph-level importance to redistribute the node attentions, e.g., giving more attentions to nodes in important paragraphs.
In particular, we utilize hierarchical attention (Hsu et al., 2018), where we first calculate attention a g t over subgraphs as done in Eq. 3 by replacing vi with subgraph representation h g p .We then combine subgraph attentions a g t with the previously calculated attentions a v t for nodes in the subgraph using scalar multiplication and renormalization over all nodes in input.This results in the new attention weights âv t , which are used to obtain graph context vector c v t as done in Eq. 3 for SegGraph.

Training Objectives
We first consider a maximum likelihood (ML) training objective that minimizes the following loss: where x are documents and y are references from the training set D, and θ are model parameters.Node Salience Labeling.In addition to modeling local characteristics of nodes, we further enhance the model by adding an objective to label node salience, e.g., whether the entities in a node are mentioned in the reference summaries.We introduce a soft mask layer over each node before it is passed into the graph encoder, to signify its salience.This layer, serving as an information gate, predicts a real number m i in [0, 1] for each node v i and multiplies to itself, i.e. m i v i .For node v i , the mask is calculated as mi = sigmoid(u 2 v i ).During training, the gold-standard mask m i for a node is set to 1 if it contains at least one content word in the reference summary; otherwise, 0. We add the following objective for all nodes in the dataset D: where N v represents the number of nodes in the dataset.Finally, the ML training objective takes the following form: L ml = L mask + L seq .

Reinforcement Learning with Cloze
After maximum likelihood training with L ml , we further design a multiple choice cloze reward in a second-stage reinforcement learning (RL), leading the model to generate more faithful and informative summaries.
For RL, we use a self-critical policy gradient algorithm (Rennie et al., 2017).During training, two summaries are generated: first, a summary y s , sampling tokens based on the probability distribution p(y s | x; θ) at each decoding step; and second, a baseline summary ŷ which greedily selects the tokens of the highest probability at each step.The objective of RL is defined based on the rewards of the two summaries, R(y s ) and R(ŷ), as follows: Our reward function uses the combination of ROUGE and the multiple choice cloze score introduced below, i.e., R(y) = R rouge (y) + γ cloze R cloze .For ROUGE, it considers F1 scores of ROUGE-1, ROUGE-2, and ROUGE-L calculated against the reference summary, and takes the form of R rouge (y Multiple Choice Cloze Reward.Here, we present a novel multiple choice cloze reward to work with our knowledge graph and guide the summarization model towards improved awareness of entity interactions.We treat the systemgenerated summary as context.We provide a set of questions automatically constructed from the corresponding reference summary written by a human.We separately train a question answering (QA) model to address the questions by reading the context.Intuitively, if the system summary shares salient information with the reference, the QA model will assign the correct answers with high probability.We decide to use the average probability of the correct answers as our cloze reward.Below, we give details on how to construct the questions and candidate answers with examples shown in Fig. 3. Question Construction.We run the OpenIE tool on human-written summaries, retaining triples with arguments not longer than 5 words.For each triple of subject, predicate, object , we create two types of questions: (1) argument pair questions, by removing the subject and object, and (2) predicate questions, by removing the predicate.Candidate Answer Construction.Because fill-inthe-blank style cloze may incorrectly penalize QA systems with answers paraphrased from the groundtruth, we opt for a multiple choice cloze.We construct three candidate answers in addition to the  gold-standard from the salient context, which are summary-worthy sentences selected from the input.Specifically, we use greedy search to select the best combination of sentences that maximizes ROUGE-2 F1 with reference to human summary.We further include a sentence in the salient context if it has a ROUGE-L recall greater than 0.6 when compared with any sentence in the reference.
We first select OpenIE triples from the salient context and filter out those that have any overlapping content word with the correct answer.For argument pair questions, we create one candidate answer by swapping the subject and the object (e.g.candidate B as in Fig. 3) and two candidates by replacing the subject or the object with another argument of the same role extracted from the salient context (e.g.candidates C and D).If not enough answers are created, we further consider randomly selecting sentences from the input.For predicate questions, we use predicates in other triples from the context as candidate answers.Among all candidates, we select the three that are able to construct the most fluent questions using perplexity predicted by BERT (Devlin et al., 2019).
In case reference summaries do not yield Ope-nIE triples, we create additional entity pair questions.We remove two co-occurring entities from the summary and create three candidate answers in the same way as described above.QA Model.We fine-tune RoBERTa (Liu et al., 2019) to build our QA model.We use the salient context described above as the context for training.We then concatenate the context, the question, and each of the four candidate answers, and pass the final [CLS] representation through a fully-connected layer, from which the answer is predicted.

Experimental Setups
Datasets.We experiment with two popular summarization datasets with summaries containing multiple sentences: the New York Times annotated corpus (NYT) (Sandhaus, 2008) and the CNN/Daily Mail dataset (CNN/DM) (Hermann et al., 2015).We follow the preprocessing steps and experimental setups from prior work (Paulus et al., 2018;See et al., 2017) for both datasets.For NYT, the training, validation, and test sets contain 588, 909, 32, 716, and 32, 703 samples.For CNN/DM, the numbers are 287, 188, 13, 367, and 11, 490.To train our cloze QA model for NYT, we construct 1, 414, 336 question-answer pairs from human-written summaries in the training set based on the method described in § 5. On CNN/DM, we collect 1, 361, 175 question-answer samples from the training set.For both datasets, we set aside 20, 000 samples as a validation set and 20, 000 samples as a test set.Our QA model achieves an accuracy of 97% on NYT and 95% on CNN.Training Details and Parameters.We use the base version of RoBERTa model to extract token features for all experiments.We truncate input articles to 1024 (NYT) and 512 (CNN/DM) BPEs.We employ LSTM models with 256-dimensional hidden states for the document encoder (128 each direction) and the decoder.For the residual connection of the graph encoder, we use 4 heads, each with a dimension of 72.For DocGraph training and inference, we prune isolated graphs with fewer than three nodes to increase robustness and reduce redundancy.We set γ 1 = 0, γ 2 = 0.75 on NYT and γ 1 = 0.33, γ 2 = 0.33 on CNN/DM after tuning on the validation set.For both datasets, we set γ cloze = 0.05.we include an extractive baseline LEAD-3.We further add the following abstractive models for comparison: (1) a pointer-generator model with coverage (See et al., 2017) (Celikyilmaz et al., 2018) (DCA).We also report results by fine-tuning BART model (Lewis et al., 2019).In Lewis et al. (2019), fine-tuning is only performed on CNN/Daily Mail.We apply the same method for NYT.
For NYT, we add results by SENECA model (Sharma et al., 2019) from our prior work, which previously achieved the best ROUGE-2.
On CNN/Daily Mail, we include comparisons of a two-stage fine-tuned model (first on an extractor, then on an abstractor) with BERT (Liu and Lapata, 2019) (BERTSUMEXTABS), and a unified pretrained language model for generation (Dong et al., 2019)

(UNILM).
In addition to ASGARD-DOC and ASGARD-SEG, which are trained with an ML objective, we report results trained with ROUGE as the reward (R rouge ), and with an additional cloze reward (R cloze ).Lastly, we consider a variant NOGRAPH by ablating the graph encoder.and Hovy, 2003) than all other comparisons except the fine-tuned BART.However, our ASGARD-SEG's ROUGE-L score is comparable to BART.This indicates the effectiveness of our graphaugmented summarization framework.Moreover, both our ASGARD-DOC and ASGARD-SEG models yield significantly higher ROUGE scores than the variant without the graph encoder (NOGRAPH).This demonstrates the benefit of using structured representation to encode entity interactions.Furthermore, both ASGARD-DOC and ASGARD-SEG with cloze reward (R cloze ) obtain significantly higher scores compared to the models trained with ROUGE reward only.This signifies that our multi-choice cloze reward can guide better semantic interpretation of content, leading to the generation of more informative summaries.We also find that ASGARD-SEG outperforms ASGARD-DOC, indicating that ASGARD-SEG better captures topic drift through multiple paragraphs.
Results on CNN/DM.We observe similar trends on the CNN/DM articles as shown in ticeably, ASGARD-DOC trained with the combined ROUGE and cloze reward produces better ROUGE scores than BERTSUMEXTABS and UNILM, which are carefully fine-tuned from large pretrained language models, and the numbers are also comparable to the fine-tuned BART.
Evaluation with Cloze Test.We further evaluate model-generated summaries with our proposed cloze test.Here, we report two scores in Fig. 4: the average probability of the correct answers output by our QA model, and its prediction accuracy.We first calculate one score per summary, then take the average over all summaries.We can see that our models with graph encoders perform better than the variant without it.

Human Evaluation
We further conduct human evaluation to analyze the informativeness and fluency of the generated summaries, as well as to investigate the unfaithful errors made by different models.We sample 100 articles from the NYT test set and hire three native or fluent speakers of English to rate summaries generated by our two systems, NOGRAPH+R rouge and ASGARD-SEG+R rouge + R cloze , along with outputs by BART and human-written summaries (presented in random order).After reading the articles, each judge scores summaries on a Likert scale from 1 (worst) to 5 (best) on informativeness-whether the summary covers important information from the input, and fluency-whether the summary is grammatically correct.We consider three types of unfaithful errors: (i) hallucination error-creating content not present in the input, (ii) out-of-context error-generating facts without including required context or within   incorrect context, and (iii) deletion or substitution error-mistakenly deleting or substituting subjects, objects, or clauses.We ask the annotators to label each type as 1 for existence of errors, and 0 otherwise.Detailed guidelines are in the Appendices.
From Table 3, we can see that our ASGARD-SEG model obtains better scores in informativeness and fluency, compared to the variant without the graph encoder.This indicates the effectiveness of leveraging knowledge graph representation.Sample output summaries by our models can be found in Fig. 5.Meanwhile, fine-tuned BART model produces outputs with similar informativeness and fluency of human-constructed summaries, suggest-ing a future direction of building our model on top of a large-pretrained encoder-decoder model.
For unfaithful errors, we report the percentage of errors calculated by majority voting (i.e., more than one annotator vote as incorrect).First, we find that our ASGARD-SEG model has a comparable error pattern as human summaries.Specifically, for out-of-context and deletion or substitution errors, our graph-enhanced model produces significantly fewer mistakes in these categories, compared to the model without graph information.This implies that knowledge graph-enhanced models can improve summary faithfulness.
Interestingly, human-written summaries are also discerned to contain a nontrivial amount of hallucination errors.After inspection, we find that human tends to leverage world knowledge to include content that is not covered by the articles.For instance, for an article discussing events in "Boston", the human writer may describe them as happening in "Massachusetts" in the summary.

Analyzing Automatic Metrics and Summary Errors
We further plot the distributions of automatic evaluation scores regarding the three types of unfaithful errors based on majority voting in Fig. 6.First, summaries with out-of-context and deletion or substitution errors receive lower cloze and ROUGE scores overall.
Nevertheless, with regard to hallucination errors, we do not see such pattern; there is even a slightly reversed relation with both cloze scores and ROUGE scores, wherein summaries with more hallucination errors tend to score higher.This echos our previous observation that human summaries can be hallucinatory too, where world knowledge is used for writing the summaries.2Furthermore, we find a weak correlation between the three variants of ROUGE scores and three types of errors, e.g., the minimum and the maximum values of Pearson's r are −0.19 and 0.14.This suggests that new metrics should be designed to better gauge summary quality.We plan to study this direction in future work.

Conclusion
We presented a novel knowledge graph-augmented abstractive summarization framework, along with a novel multiple choice cloze reward for reinforcement learning.Our models capture both local characteristics and global interactions of entities from the input, thus generating summaries of higher quality.In tandem with the graph representation, our cloze reward further improves summary content.
Human evaluation further confirms that our graphaugmented models trained with the cloze reward produce more informative summaries and significantly reduces unfaithful errors.We use the base version of BERT model (Devlin et al., 2019) to select candidate answers and we finetune the base version of RoBERTa model (Liu et al., 2019) to build our QA model.We take pretrained models from Wolf et al. (2019).

A.2 Human Evaluation Guideline
In our human evaluation, each human annotator is presented with 100 news articles.The annotators are asked to evaluate four summaries (in random order) for each article on two aspects (informativeness and fluency) on a scale of 1 to 5 (1 being very poor and 5 being very good).Furthermore, for unfaithfulness, we define three types of unfaithful errors and ask annotators to label whether summaries contain any type of error.Instructions in Table 5 are given to human judges.
Here are descriptions of the aspects: • Informativeness: Whether the summary provides enough and necessary content coverage from the input article.
• Fluency: Whether the summary is free of obvious grammatically incorrect sentences (e.g., fragments, missing components) that make the text difficult to read.
• Faithfulness: Whether the summary accords with the facts expressed in the source.

Figure 1 :
Figure 1: Sample knowledge graph constructed from an article snippet.The graph localizes relevant information for entities (color coded, e.g."John M. Fabrizi") or events (underlined) and provides global context.

Figure 3 :
Figure 3: Sample construction of multiple choice cloze questions and candidate answers from reference summary and salient context.Arguments and predicates in candidate answers are color-coded and italicized.

Figure 5 :
Figure 5: Sample summaries for an NYT article.Summaries by our models with the graph encoder are more informative than the variant without it.

Figure 6 :
Figure 6: Distribution of automatic summarization metrics with three types of unfaithful errors."True" indicates summaries with such type of error.

Table 1 :
More details about parameters and graph statistics are in the Appendices.Baselines and Comparisons.For both datasets, Automatic evaluation with ROUGE on New York Times.Best results are in boldface.Best of our models are in italics.ASGARD-SEG+R rouge +R cloze yields significantly higher scores than our other models with approximate randomization test (p < 0.0005).

Table 2 :
Automatic evaluation with ROUGE on CNN/Daily Mail.Best results of our model variants are in italics.Both ASGARD-SEG+R rouge +R cloze and ASGARD-DOC+R rouge +R cloze obtain significantly better scores than other model variants (p < 0.0005).

Table 3
Lesbian couple in South Jersey wins court approval to have both of their names listed as parents on birth certificate of their newborn.it will no longer oppose such applications ASGARD-doc+Rrouge + R cloze : Lesbian couple in South Jersey, won court approval to have both of their names listed as parents on birth certificate of their newborn.attorney general's office says it will no longer oppose such applications ASGARD-seg+Rrouge + R cloze :

Table 4 :
(Paulus et al., 2018)sStatistics of Knowledge Graphs.We show the statistics of knowledge graphs on two datasets in Table4.On each dataset, we construct a large graph with abundant relations for each article.Note that on CNN/DM we have more arguments but fewer predicates in a document than those on NYT.This indicates CNN/DM has fewer coreferred entities.Training Details.We utilize Adam (Kingma and Ba, 2015) with a gradient clipping of 2.0 and a batch size of 32 for all models.During ML training, a learning rate of 0.001 is used; during RL stage, it is reduced to 0.0001(Paulus et al., 2018).Arg.# Pre.# Arg.# Pre.# Para.NYT 795.9 131.6 87.3 6.40 3.74 23.5 CNN/DM 789.9 138.1 85.2 6.30 3.57 24.2 Statistics of NYT and CNN/DM datasets.# Arg.: number of arguments in each document or paragraph.# Pre.: number of predicates in each document or paragraph.# Para.: number of paragraphs in each document.Two datasets have comparable graph size.