An Editorial Network for Enhanced Document Summarization

We suggest a new idea of Editorial Network – a mixed extractive-abstractive summarization approach, which is applied as a post-processing step over a given sequence of extracted sentences. We further suggest an effective way for training the “editor” based on a novel soft-labeling approach. Using the CNN/DailyMail dataset we demonstrate the effectiveness of our approach compared to state-of-the-art extractive-only or abstractive-only baselines.


Introduction
Automatic text summarizers condense a given piece of text into a shorter version (the summary).This is done while trying to preserve the main essence of the original text and keeping the generated summary as readable as possible.
Existing summarization methods can be classified into two main types, either extractive or abstractive [8].Extractive methods select and order text fragments (e.g., sentences) from the original text source [2,5,6,7,18,29].Such methods are relatively simpler to develop and keep the extracted fragments untouched, allowing to preserve important parts, e.g., keyphrases, facts, opinions, etc.Yet, extractive summaries tend to be less fluent, coherent and readable and may include superfluous text.
Abstractive methods apply natural language paraphrasing and/or compression on a given text.A common approach is based on the encoder-decoder (seq-to-seq) paradigm [24], with the original text sequence being encoded while the summary is the decoded sequence.While such methods usually generate summaries with better readability, their quality declines over longer textual inputs, which may lead to higher redundancy [22].Moreover, such methods are sensitive to vocabulary size, making them more difficult to train and generalize [23].
A common approach for handling long text sequences in abstractive settings is through attention mechanisms, which aim to imitate the attentive reading behaviour of humans [3].Two main types of attention methods may be utilized, either soft or hard.Soft attention methods first locate salient text regions within the input text and then bias the abstraction process to prefer such regions during decoding [4,9,12,19,15,21,25]. On the other hand, hard attention methods perform abstraction only on text regions that were initially selected by some extraction process [1,18,17].
Compared to previous works, whose final summary is either entirely extracted or generated using an abstractive process, in this work, we suggest a new idea of "Editorial Network" (EditNet) -a mixed extractive-abstractive summarization approach.A summary generated by EditNet may include sentences that were either extracted, abstracted or of both types.Moreover, per considered sentence, EditNet may decide not to take either of these decisions and completely reject the sentence.Using the CNN/DailyMail dataset we demonstrate that, EditNet's summarization quality transcends that of state-ofthe-art abstractive-only baselines.EditNet's summarization quality is also demonstrated to be highly competitive with that of NeuSum [30], which is, to the best of our knowledge, the best performing extractive-only baseline.Yet, while EditNet obtains (more or less) a similar summarization quality to that of NeuSum, compared to the latter which applies only extraction, the former (on average) applies abstraction to the majority of each summary's extracted sentences.

Editorial Network
Figure 1 now depicts the architecture of our proposed Editorial Network-based approach.We apply this approach as a post-processing step over a given summary whose sentences were selected by some extractor.The key idea is to try to imitate the decision process of a human editor who needs to edit the summary so as to enhance its quality.Let S denote a summary which was extracted from a given text (document) D. The editorial process is implemented by iterating over sentences in S according to the selection order of the extractor.For each sentence in S, the "editor" may make three possible decisions.The first decision is to keep the extracted sentence untouched (represented by label E in Figure 1).The second alternative is to rephrase the sentence (represented by label A in Figure 1).Such a decision, for example, may represent the editor's wish to simplify or compress the original source sentence.The last possible decision is to completely reject the sentence (represented by label R in Figure 1).For example, the editor may wish to ignore a superfluous or duplicate information expressed in the current sentence.An example mixed summary generated by our approach is depicted in Figure 2, further emphasizing the various editor's decisions.

Implementing the editor's decisions
For a given sentence s ∈ D, we now denote by s e and s a its original (extracted) and paraphrased (abstracted) versions.
To obtain s a we use an abstractor, whose details will be shortly explained (see Section 2.2).Let e s ∈ R n and a s ∈ R n further denote the corresponding sentence representations of s e and s a , respectively.Such representations allow to compare both sentence versions on the same grounds.
Recall that, for each sentence s i ∈ S (in order) the editor makes one of the three possible decisions: extract, abstract or reject s i .Therefore, the editor may modify summary S by paraphrasing or rejecting some of its sentences, resulting in a mixed extractive-abstractive summary S ′ .
Editor's automatic summary: E: what was supposed to be a fantasy sports car ride at walt disney world speedway turned deadly when a lamborghini crashed into a guardrail.A: the crash took place sunday at the exotic driving experience a .A: the lamborghini 's passenger , gary terry , died at the scene b .R: petty holdings , which operates the exotic driving experience at walt disney world speedway , released a statement sunday night about the crash.
a Original extracted sentence: "the crash took place sunday at the exotic driving experience , which bills itself as a chance to drive your dream car on a racetrack".
b Original extracted sentence: "the lamborghini 's passenger , 36-year-old gary terry of davenport , florida , died at the scene , florida highway patrol said" Ground truth summary: the crash occurred at the exotic driving experience at walt disney world speedway.officials say the driver , 24-year-old tavon watson , lost control of a lamborghini.passenger gary terry , 36 , died at the scene.Following [1,27], d is then calculated as follows: where The second auxiliary representation is that of the summary that was generated by the editor so far, denoted at step i as g i−1 ∈ R n , with g 0 = 0.Such a representation provides a local context for decision making.Given the four representations as an input, the editor's decision for sentence s i ∈ S is implemented using two fully-connected layers, as follows: where In each step i, therefore, the editor chooses the action π i ∈ {E, A, R} with the highest likelihood (according to Eq. 2), further denoted p(π i ).Upon decision, in case it is either E or A, the editor appends the corresponding sentence version (i.e., either s e i or s a i ) to S ′ ; otherwise, the decision is R and sentence s i is discarded.Depending on its decision, the current summary representation is further updated as follows: where W g ∈ R n×n are learnable parameters, g i−1 is the summary representation from the previous decision step; and h i ∈ {e si , a si , 0}, depending on which decision is made.
Such a network architecture allows to capture various complex interactions between the different inputs.For example, the network may learn that given the global context, one of the sentence versions may allow to produce a summary with a better coverage.As another example, based on the interaction between both sentence versions with either of the local or global contexts (and possibly among the last two), the network may learn that both sentence versions may only add superfluous or redundant information to the summary, and therefore, decide to reject both.

Extractor and Abstractor
As a proof of concept, in this work, we utilize the extractor and abstractor that were previously used in [1], with a slight modification to the latter, motivated by its specific usage within our approach.We now only highlight important aspects of these two sub-components and kindly refer the reader to [1] for the full implementation details.
The extractor of [1] consists of two main sub-components.The first is an encoder which encodes each sentence s ∈ D into e s using an hierarchical representation2 .The second is a sentence selector using a Pointer-Network [26].For the latter, let P (s) be the selection likelihood of sentence s.
The abstractor of [1] is basically a standard encoder-aligner-decoder with a copy mechanism [23].Yet, instead of applying it directly only on a single given extracted sentence s e i ∈ S, we apply it on a "chunk" of three consecutive sentences 3 (s e − , s e i , s e + ), where s e − and s e + denote the sentence that precedes and succeeds s e i in D, respectively.This in turn, allows to generate an abstractive version of s e i (i.e., s a i ) that benefits from a wider local context.Inspired by previous soft-attention methods, we further utilize the extractor's sentence selection likelihoods P (•) for enhancing the abstractor's attention mechanism, as follows.Let C(w j ) denote the abstractor's original attention value of a given word w j occurring in (s e − , s e i , s e + ); we then recalculate this value to be C ′ (w j ) = C(wj)•P (s) Z , with w j ∈ s and s ∈ {s e − , s e i , s e + }; Z = s ′ ∈{s e − ,s e i ,s e + } wj ∈s ′ C(w j ) • P (s ′ ) denotes the normalization term.

Sentence representation
Recall that, in order to compare s e i with s a i , we need to represent both sentence versions on as similar grounds as possible.To achieve that, we first replace s e i with s a i within the original document D. By doing so, we basically treat sentence s a i as if it was an ordinary sentence within D, where the rest of the document remains untouched.We then obtain s a i 's representation by encoding it using the extractor's encoder in a similar way in which sentence s e i was originally supposed to be encoded.This results in a representation a si that provides a comparable alternative to e si , whose encoding is expected to be effected by similar contextual grounds.

Network training
We conclude this section with the description of how we train the editor using a novel soft labeling approach.Given text S (with l extracted sentences), let π = (π 1 , . . ., π l ) denote its editing decisions sequence.We define the following "soft" cross-entropy loss: where, for a given sentence s i ∈ S, y(π i ) denotes its soft-label for decision.

Dataset and Setup
We trained, validated and tested our approach using the non-annonymized version of the CNN/DailyMail dataset [11].Following [19], we used the story highlights associated with each article as its ground truth summary.We further used the F-measure versions of ROUGE-1 (R-1), ROUGE-2 (R-2) and ROUGE-L (R-L) as our evaluation metrics [16].

Results
Table 1 compares the quality of EditNet with that of several state-of-the-art extractive-only or abstractive-only baselines.This includes the extractor (rnn-ext-RL) and abstractor (rnn-ext-abs-RL) components of [1] that we further utilized for implementing 5 EditNet (see again Section 2.2).
Overall, EditNet provides a highly competitive summary quality, where it outperforms all baselines in the R-2 and R-L metrics.On R-1, EditNet outperforms all abstractive baselines and almost all extractive ones.Interestingly, EditNet's summarization quality is quite similar to that of NeuSum [30].Yet, while NeuSum applies an extraction-only approach, summaries generated by EditNet include a mixture of sentences that have been either extracted or abstracted.
On average, 56% and 18% of EditNet's decisions were to abstract (A) or reject (R), respectively.Moreover, on average, per summary, EditNet keeps only 33% of the original (extracted) sentences, while the rest (67%) are abstracted ones.This demonstrates that, EditNet has a high capability of utilizing abstraction, while being also able to maintain or reject the original extracted text whenever it is estimated to provide the best benefit for the summary's quality.

Conclusion and Future Work
We have shown that instead of solely applying extraction or abstraction, a better choice would be a mixed one.As future work, we plan to evaluate other alternative extractor+abstractor configurations and try to train the network end-to-end.We further plan to explore reinforcement learning (RL) as an alternative decision making approach.
Figure 1: Editorial Network

Figure 2 :
Figure 2: An example mixed summary (annotated with the editor's decisions) taken from the CNN/DM dataset