How to Write Summaries with Patterns? Learning towards Abstractive Summarization through Prototype Editing

Under special circumstances, summaries should conform to a particular style with patterns, such as court judgments and abstracts in academic papers. To this end, the prototype document-summary pairs can be utilized to generate better summaries. There are two main challenges in this task: (1) the model needs to incorporate learned patterns from the prototype, but (2) should avoid copying contents other than the patternized words—such as irrelevant facts—into the generated summaries. To tackle these challenges, we design a model named Prototype Editing based Summary Generator (PESG). PESG first learns summary patterns and prototype facts by analyzing the correlation between a prototype document and its summary. Prototype facts are then utilized to help extract facts from the input document. Next, an editing generator generates new summary based on the summary pattern or extracted facts. Finally, to address the second challenge, a fact checker is used to estimate mutual information between the input document and generated summary, providing an additional signal for the generator. Extensive experiments conducted on a large-scale real-world text summarization dataset show that PESG achieves the state-of-the-art performance in terms of both automatic metrics and human evaluations.


Introduction
Abstractive summarization can be regarded as a sequence mapping task that maps the source text to the target summary (Rush et al., 2015;Li et al., 2017;Cao et al., 2018;Gao et al., 2019a). It has drawn significant attention since the introduction * Equal contribution. Ordering is decided by a coin flip. † Corresponding author. 1 https://github.com/gsh199449/proto-summ prototype summary The court held that the defendant Wang had stolen the property of others for the purpose of illegal possession. The amount was large, and his behavior constituted the crime of theft. The accusation of the public prosecution agency was established. The defendant Wang has a criminal record and will be considered when sentencing. Since the defendant Wang did not succeed because of reasons other than his will, he could be punished lightly. After the defendant confessed his crimes to the case, he was given a lighter punishment according to law.
summary The court held that the accused Zhang and Fan stole property and the amount was large. Their actions constituted the crime of theft. The accusation of the public prosecution agency was established and supported. This crime was committed within two years after the release of the defendants Zhang and Fan. Thus they are recidivists and this situation will be considered when sentencing. The fact that defendants Zhang and Fan surrendered themselves and pleaded guilty in court gives a lighter punishment according to law. of deep neural networks to natural language processing. Under special circumstances, the generated summaries are required to conform to a specific pattern, such as court judgments, diagnosis certificates, abstracts in academic papers, etc. Take the court judgments for example, there is always a statement of the crime committed by the accused, followed by the motives and the results of the judgment. An example case is shown in Table 1, where the summary shares the same writing style and has words in common with the prototype summary (retrieved from the training dataset).
Existing prototype based generation models such as  are all applied on short text, thus, cannot handle long documents summarization task. Another series of works focus on template-based methods such as (Oya et al., 2014). However, template-based methods are too rigid for our patternized summary generation task. Hence, in this paper, we propose a summarization framework named Prototype Editing based Summary Generator (PESG) that incorporates prototype document-summary pairs to improve summa-rization performance when generating summaries with pattern. First, we calculate the cross dependency between the prototype document-summary pair to obtain a summary pattern and prototype facts (explained in § 4.2). Then, we extract facts from the input document with the help of the prototype facts (explained in § 4.3). Next, a recurrent neural network (RNN) based decoder is used to generate a new summary, incorporating both the summary pattern and extracted facts (explained in § 4.4). Finally, a fact checker is designed to provide mutual information between the generated summary and the input document to prevent the generator from copying irrelevant facts from the prototype (explained in § 4.5). To evaluate PESG, we collect a large-scale court judgment dataset, where each judgment is a summary of the case description with a patternized style. Extensive experiments conducted on this dataset show that PESG outperforms the state-of-the-art summarization baselines in terms of ROUGE metrics and human evaluations by a large margin.
Our contributions can be summarized as follows: • We propose to use prototype information to help generate better summaries with patterns.
• Specifically, we propose to generate the summary incorporating the prototype summary pattern and extracted facts from input document.
• We provide mutual information signal for the generator to prevent copying irrelevant facts from the prototype.
• We release a large-scale prototype based summarization dataset that is beneficial for the community.

Related Work
We detail related work on text summarization and prototype editing.
Text summarization can be classified into extractive and abstractive methods. Extractive methods (Narayan et al., 2018b;Chen et al., 2018) directly select salient sentences from an article to compose a summary. One shortcoming of these models is that they tend to suffer from redundancy. Recently, with the emergence of neural network models for text generation, a vast majority of the literature on summarization (Ma et al., 2018;Gao et al., 2019a;Chen et al., 2019) is dedicated to abstractive summarization, which aims to generate new content that concisely para-phrases a document from scratch.
Another line of research focuses on prototype editing. (Guu et al., 2018) proposed the first prototype editing model, which samples a prototype sentence from training data and then edits it into a new sentence. Following this work,  proposed a new paradigm for response generation, which first retrieves a prototype response from a pre-defined index and then edits the prototype response. (Cao et al., 2018) applied this method on summarization, where they employed existing summaries as soft templates to generate new summary without modeling the dependency between the prototype document, summary and input document. Different from these soft attention methods, (Cai et al., 2018) proposed a hard-editing skeleton-based model to promote the coherence of generated stories. Template-based summarization is also a hard-editing method (Oya et al., 2014), where a multi-sentence fusion algorithm is extended in order to generate summary templates.
Different from all above works, our model focuses on patternized summary generation, which is more challenging than traditional news summarization and short sentence prototype editing.
For a given document X, our model extracts salient facts from X guided by a prototype doc-umentX, and then generates the summary Y by referring to the prototype summaryŶ . The goal is to generate a summary Y that not only follows a patternized style (as defined by prototype sum-maryŶ ) but also is consistent with the facts in document X.

Overview
In this section, we propose our prototype editing based summary generator, which can be split into two main parts, as shown in Figure 1: • Summary Generator.
termine the summary pattern and prototype facts.
(2) Fact Extraction module extracts facts from the input document under the guidance of the prototype facts.
(3) Editing Generator module generates the summary Y of document X by incorporating summary pattern and facts.
• Fact Checker estimates the mutual information between the generated summary Y and input document X. This information provides an additional signal for the generation process, preventing irrelevant facts from being copied from the prototype document.

Prototype Reader
To begin with, we use an embedding matrix e to map a one-hot representation of each word in X, X,Ŷ into a high-dimensional vector space. We then employ a bi-directional recurrent neural network (Bi-RNN) to model the temporal interactions between words: where h x t ,ĥ x t andĥ y t denote the hidden state of tth step in Bi-RNN for X,X andŶ , respectively. Following Gao et al., 2019b;Hu et al., 2019), we choose long short-term memory (LSTM) as the cell for Bi-RNN.
On one hand, the sections in the prototype summary that are not highly related to the prototype document are the universal patternized words and should be emphasized when generating the new summary. On the other hand, the sections in the prototype document that are highly related to the prototype summary are useful facts that can guide the process of extracting facts from input document. Hence, we employ a bi-directional attention mechanism between a prototype document and summary to analyze the cross-dependency, that is, from document to summary and from summary to document. Both of these attentions are derived from a shared similarity matrix, S ∈ R Tm×Tn , calculated by the hidden states of prototype doc-umentX and prototype summaryŶ . S ij indicates the similarity between the i-th document wordx i and j-th summary wordŷ j and is computed as: where α is a trainable scalar function that calculates the similarity between two input vectors. ⊕ denotes a concatenation operation and ⊗ is an element-wise multiplication.
We use a s t = mean(S :t ) ∈ R to represent the attention weight on the t-th prototype summary word by document words, which will learn to assign high weights to highly related universal patternized words when generating a summary. From a s t , we obtain the weighted sum of the hidden states of prototype summary as "summary pattern" l = {l 1 , . . . , l Tn }, where l i is: Similarly, a d t = mean(S t: ) ∈ R assigns high weights to the words in a prototype document that are relevant to the prototype summary. A convolutional layer is then applied to extract "prototype facts"r t from the prototype document: We sum the prototype facts to obtain the overall representation of these facts:

Fact Extraction
In this section, we discuss how to extract useful facts from an input document with the help of prototype facts. We first extract the facts from an input document by calculating their relevance to prototype facts. The similarity matrix E is then calculated between the weighted prototype document a d iĥ x i and input document representation h x j : where α is the similarity function introduced in Equation 4. Then, we sum up E ij along the length of the prototype document to obtain the weight E j = Tm ti E tj for j-th word in the document. Next, similar to Equation 6, a convolutional layer is applied on the weighted hidden states E t h x t to obtain the fact representation r t from the input document: Inspired by the polishing strategy in extractive summarization (Chen et al., 2018), we propose to use the prototype facts to polish the extracted facts r t and obtain the final fact representation m . , as shown in Figure 2. Generally, the polishing process consists of two hierarchical recurrent layers. The first recurrent layer is made up of Selective Recurrent Units (SRUs), which take facts r · and polished fact q k as input, outputting the hidden state h k Tm . The second recurrent layer consists of regular Gated Recurrent Units (GRUs), which are used to update the polished fact from q k to q k+1 using h k Tm . SRU is a modified version of the original GRU introduced in (Chen et al., 2018), details of which can be found in Appendix A. Its difference from GRU lies in that the update gate in SRU is decided by both the polished fact q k and original fact r t together. The t-th hidden state of SRU is calculated as: We take h k Tm as the overall representation of all input facts r · . In this way, SRU can decide to which degree each unit should be updated based on its relationship with the polished fact q k .
Next, h k Tm is used to update the polished fact q k using the second recurrent layer, consisting of GRUs: where q k is the cell state, h k Tm is the input and m k+1 is the output hidden state. q 0 is initialized using q in Equation 7. This iterative process is conducted K times, and each output m k is stored as extracted facts M = {m 1 , m 2 , . . . , m K }. In this way, M stores facts with different polished levels.

Editing Generator
The editing generator aims to generate a summary based on the input document, prototype summary and extracted facts. As with the settings of prototype reader, we use LSTM as the RNN cell. We first apply a linear transformation on the summation of the summary pattern l = Tn i l i and input document representations h x Tm , and then employ this vector as the initial state d 0 of the RNN generator as shown in Equation 12. The procedure of t-th generation is shown in Equation 13: where W e , b e are trainable parameters, d t is the hidden state of the t-th generating step, and g i t−1 is the context vector produced by the standard attention mechanism (Bahdanau et al., 2015).
To take advantage of the extracted facts M and prototype summary l, we incorporate them both into summary generation using a dynamic attention. More specifically, we utilize a matching function f to model the relationship between the current decoding state d t and each v i (v i can be a extracted fact m i or summary pattern l i ): where g * t can be g m t or g s t for attending to extracted facts or a summary pattern, respectively. We use a simple but efficient bi-linear layer as the matching function f = m i W f d t . As for combining g m t and g s t , we propose to use an "editing gate" γ, which is determined by the decoder state d t , to decide the importance of the summary pattern and extracted facts at each decoding step.
where σ denotes the sigmoid function. Using the editing gate, we obtain g h t which dynamically combines information from the extracted facts and summary pattern with the editing gate γ, as: Finally, the context vector g h t is concatenated with the decoder state d t and fed into a linear layer to obtain the generated word distribution P v : The loss is the negative log likelihood of the target word y t : In order to handle the out-of-vocabulary (OOV) problem, we equip our decoder with a pointer network (Gu et al., 2016;Vinyals et al., 2015;See et al., 2017). This process is the same as the model described in (See et al., 2017), thus, is omit here due to limited space.
What's more, previous work (Holtzman et al., 2018) has found that using a cross entropy loss alone is not enough for generating coherent text. Similarly, in our task, using L s alone is not enough to distinguish a good summary with accurate facts from a bad summary with detailed facts from the prototype document (see § 6.2). Thus, we propose a fact checker to determine whether the generated summary is highly related to the input document.

Fact Checker
To generate accurate summaries that are consistent with the detailed facts from the input document rather than facts from the prototype document, we add a fact checker to provide additional training signals for the generator. Following (Hjelm et al., 2018), we employ the neural mutual information estimator to estimate the mutual information between the generated summary Y and its corresponding document X, as well as the prototype documentX. Generally, mutual information is estimated from a local and global level, and we expect the matching degree to be higher between the generated summary and input document than the prototype document. An overview of the fact checker is shown in Figure 3.
To begin, we use a local matching network to calculate the matching degree, for local features, between the generated summary and the input, as well as prototype document. Remember that, in C f = {d Tn ⊕r 1 , . . . , d Tn ⊕r Tm }.
A 1 × 1 convolutional layer and a fully-connected layer are applied to score these two features: where τ r l ∈ R, τ f l ∈ R represent the local matching degree between the generated summary and input document and prototype document, respectively. We want the generated summary to be more similar to the input document than the prototype document. Thus, the optimization objective of the local matching network is to minimize L l : We also have a global matching network to measure the matching degree, for global features, between the generated summary and the input document, as well as prototype document. To do so, we concatenate the representation of the generated summary with the final hidden state of the input document h x Tm and final state of the prototype documentĥ x Tm , respectively, and apply a linear layer to these: where W m , b m are trainable parameters and τ r g ∈ R and τ f g ∈ R represent the matching degree between the generated summary and the input document, and prototype document, respectively. The objective of this global matching network, similar to the local matching network, is to minimize:  Finally, we combine the local and global loss functions to obtain the final loss L, which we use L to calculate the gradients for all parameters: where , η are both hyper parameters. To optimize the trainable parameters, we employ the gradient descent method Adagrad (Duchi et al., 2010) to update all parameters.

Dataset
We collect a large-scale prototype based summarization dataset 2 , which contains 2,003,390 court judgment documents. In this dataset, we use a case description as an input document and the court judgment as the summary. The average lengths of the input documents and summaries are 595.15 words and 273.57 words respectively. The percentage of words common to a prototype summary and the reference summary is 80.66%, which confirms the feasibility and necessity of prototype summarization. Following other summarization datasets (Grusky et al., 2018;Kim et al., 2019;Narayan et al., 2018a), we also count the novel ngrams in a summary compared with the n-grams in the original document, and the percentage of novel n-grams are 51.21%, 84.59%, 91.48%, 94.83% for novel 1-grams to 4-grams respectively. The coverage, compression and density (Grusky et al., 2018) are commonly used as metrics to evaluate the abstractness of a summary. For the summaries in our dataset, the coverage percentage is 48.78%, compression is 2.28 and density is 1.31. We anonymize entity tokens into special tags, such as using "PERS" to replace a person's name.

Comparisons
In order to prove the effectiveness of each module of PESG, we conduct several ablation studies, shown in Table 2. We also compare our model with the following baselines: (1) Lead-3 is a commonly used summarization baseline (Nallapati et al., 2017;See et al., 2017), which selects the first three sentences of document as the summary.
(2) S2S is a sequence-to-sequence framework with a pointer network, proposed by (See et al., 2017). (3) Proto is a context-aware prototype editing dialog response generation model proposed by . (4) Re 3 Sum, proposed by (Cao et al., 2018), uses an IR platform to retrieve proper summaries and extends the seq2seq framework to jointly conduct template-aware summary generation. (5) Uni-model was proposed by (Hsu et al., 2018), and is the current stateof-the-art abstractive summarization approach on the CNN/DailyMail dataset. (6) We also directly concatenate the prototype summary with the original document as input for S2S and Uni-model, named as Concat-S2S and Concat-Uni, respectively.

Evaluation Metrics
For the court judgment dataset, we evaluate standard ROUGE-1, ROUGE-2 and ROUGE-L (Lin, 2004) on full-length F1 following previous works (Nallapati et al., 2017;See et al., 2017;Paulus et al., 2018), where ROUGE-1 (R1), ROUGE-2 (R2), and ROUGE-L (RL) refer to the matches of unigram, bigrams, and the longest common subsequence respectively. (Schluter, 2017) notes that only using the ROUGE metric to evaluate summarization quality can be misleading. Therefore, we also evaluate our model by human evaluation. Three highly educated participants are asked to score 100 randomly sampled summaries generated by three models: Uni-model, Re 3 Sum and PESG. The statistical significance of observed differences between the performance of two runs is tested using a twotailed paired t-test and is denoted using (or ) for strong (or weak) significance for α = 0.01.

Implementation Details
We implement our experiments in Tensor-Flow (Abadi et al., 2016) on an NVIDIA GTX 1080 Ti GPU. The word embedding dimension is 256 and the number of hidden units is 256. The batch size is set to 64. We padded or cut input document to contain exactly 250 words, and the decoding length is set to 100. and η from the Equation 28 are both set to 1.0. We initialize all of the parameters randomly using a Gaussian distribution. We use Adagrad optimizer (Duchi The court held that the defendant PERS was drunk driving a motor vehicle on the road, and his behavior constituted a dangerous driving offence and should be punished according to law. Proto ---  et al., 2010) as our optimizing algorithm and employ beam search with size 5 to generate more fluency summary sentence. We also apply gradient clipping (Pascanu et al., 2013) with range [−5, 5] during training. We use dropout (Srivastava et al., 2014) as regularization with keep probability p = 0.7.
6 Experimental Result

Overall Performance
We compare our model with the baselines listed in Table 3. Our model performs consistently better than other summarization models including the state-of-the-art model with improvements of 6%, 12% and 6% in terms of ROUGE-1, ROUGE-2 and ROUGE-L. This demonstrates that prototype document-summary pair provides strong guidance for summary generation that cannot be replaced by other complicated baselines without prototype information. Meanwhile, directly concatenating the prototype summary with the original input does not increase performance, instead leading to drops of 9%, 17%, 8% and 1%, 3%, 2% in terms of ROUGE 1,2,L on the S2S and Unified models, respectively. As for the baseline model Proto, we found that it directly copies from the prototype summary as generated summary, which leads to a totally useless and incorrect summary. For the human evaluation, we asked annotators to rate each summary according to its consistency and fluency. The rating score ranges from 1 to 3, with 3 being the best. Table 4 lists the average scores of each model, showing that PESG outperforms the other baseline models in both fluency and consistency. The kappa statistics are 0.33 and 0.29 for fluency and consistency respectively, and that indicates the moderate agreement between annotators. To prove the significance of these results, we also conduct the paired student t-test between our model and Re 3 Sum (row with shaded background). We obtain a p-value of 2 × 10 −7 and 9 × 10 −12 for fluency and consistency, respectively.
We also analyze the effectiveness of performance by the two hyper-parameters: η and . It turns out that our model has a consistently good performance, with ROUGE-1, ROUGE-2, ROUGE-L scores above 39.5, 27.5, 39.4, which demonstrates that our model is very robust.

Ablation Study
The ROUGE scores of different ablation models are shown in Table 5. All ablation models perform worse than PESG in terms of all metrics, which demonstrates the preeminence of PESG. More importantly, by this controlled experiment, we can verify the contributions of each modules in PESG.

Analysis of Editing Generator
We visualize the editing gate (illustrated in Equa-  shown in Figure 4. A lower weight (lighter color) means that the word is more likely to be copied from the summary pattern; that is to say, this word is a universal patternized word. We can see that the phrase 本院认为 (the court held that) has a lower weight than the name of the defendant (PERS), which is consistent with the fact that (the court held that) is a patternized word and the name of the defendant is closely related to the input document.
We also show a case study in Table 6, which includes the input document and reference summary with the generated summaries. Underlined text denotes a grammar error and a strike-through line denotes a fact contrary to the input document. We only show part of the document and summary due to limited space; the full version is shown in Appendix. As can be seen, the summary generated by Uni-model faces an inconsistency problem and the summary generated by Re 3 Sum is contrary to the facts described in the input document. However, PESG overcomes both of these problems and generates an accurate summary with good grammar and logic.

Analysis of Fact Extraction Module
We investigate the influence of the iteration number when facts are extracted. Figure 5 illustrates the relationship between iteration number and the f-value of the ROUGE score. The results show that the ROUGE scores first increases with the number of hops. After reaching an upper limit it then begins to drop. This phenomenon demonstrates that the fact extraction module is effective by polishing the facts representation.

Conclusion
In this paper, we propose a framework named Prototype Editing based Summary Generator (PESG), which aims to generate summaries in formal  writing scenarios, where summaries should conform to a patternized style. Given a prototype document-summary pair, our model first calculates the cross dependency between the prototype document-summary pair. Next, a fact extraction module is employed to extract facts from the document, which are then polished. Finally, we design an editing-based generator to produce a summary by incorporating the polished fact and summary pattern. To ensure that the generated summary is consistent with the input document, we propose a fact checker to estimate the mutual information between the input document and generated summary. Our model outperforms state-of-the-art methods in terms of ROUGE scores and human evaluations by a large margin, which demonstrates the effectiveness of PESG.

A SRU Cell
Gated recurrent unit (GRU) (Cho et al., 2014) is a gating mechanism in recurrent neural networks, which incorporate an update gate in an RNN. We first give the details of the original GRU here.
where σ is the sigmoid activation function, W (u) , W (r) , W (h) ∈ R n H ×n I , U (u) , U (r) , U ∈ R n H ×n H , n H is the hidden size, and n I is the size of input x i . In the original version of GRU, the update gate u i in Equation 29 is used to decide how much of the hidden state should be retained and how much should be updated. In our case, we want to decide which facts are salient according to the polished facts q k−1 at the k-th hop. To achieve this, we replace the calculation of u i with the newly computed update gate g i : where W (2) , W (1) , b (1) , b (2) are all trainable parameters and k is the hop number in the multi-hop situation which is a hyper-parameter manually set. The effectiveness of this hyper-parameter is verified in the experimental results shown in § 6.4. Equation 32 now becomes: We use the name "SRU" to denote this modified version of an GRU cell. 决如下被告人PERS 犯服刑判处有期徒刑MONTHS并处罚金 人民币MONEY (The court held that NUM PERS was driving a motor vehicle on the road, and his behavior constituted that the appellant was guilty of serving a sentence. The criminal facts accused by the public prosecution agency were clear, the evidence was indeed sufficient, and the charges were supported by the court. After confessing his crimes, he should be given a lighter punishment according to law. The public prosecution agency's sentencing recommendations were appropriate and adopted by the court. According to the provisions of Section NUM of the Criminal Law of the People's Republic of China, the judgment was as follows, the defendant PERS was sentenced to imprisonment and sentenced to fixed-term MONTHS imprisonment, and fined the penalty RMB MONEY.)

B Case Study
The court believed that the criminal PERS did have repentance during his sentence. In accordance with the statutory commutation conditions and NUM and NUM of the Criminal Law of the People's Republic of China, the ruling was as follows: exempted PERS from the MONTHS penalty. Legal effect would occur upon the delivery of this ruling.) Table 7: Examples of the generated natural answers by PESG and other models.