Towards Annotating and Creating Sub-Sentence Summary Highlights

Highlighting is a powerful tool to pick out important content and emphasize. Creating summary highlights at the sub-sentence level is particularly desirable, because sub-sentences are more concise than whole sentences. They are also better suited than individual words and phrases that can potentially lead to disfluent, fragmented summaries. In this paper we seek to generate summary highlights by annotating summary-worthy sub-sentences and teaching classifiers to do the same. We frame the task as jointly selecting important sentences and identifying a single most informative textual unit from each sentence. This formulation dramatically reduces the task complexity involved in sentence compression. Our study provides new benchmarks and baselines for generating highlights at the sub-sentence level.


Introduction
Highlighting at an appropriate level of granularity is important to emphasize salient content in an unobtrusive manner. A small collection of keywords may be insufficient to deliver the main points of an article, while highlighting whole sentences often provide superfluous information. In domains such as newswire, scholarly publications, legal and policy documents (Kim et al., 2010;Sadeh et al., 2013;Hasan and Ng, 2014), people are tempted to write long and complicated sentences. It is particularly desirable to pick out only important sentence parts as opposed to whole sentences.
The methods select representative sentences from source documents, then delete nonessential words and constituents to form compressed summaries. Nonetheless, making multiple interdependent decisions on word deletion can render summaries ungrammatical and fragmented. In this paper, we investigate an alternative formulation that can dramatically reduce the task complexity involved in sentence compression.
We frame the task as jointly selecting representative sentences from a document and identifying a single most informative textual unit from each sentence to create sub-sentence highlights. This formulation is inspired by rhetorical structure theory (RST; Mann and Thompson, 1988) where subsentence highlights resemble the nuclei which are text spans essential to express the writer's purpose. The formulation also mimics human behavior on picking out important content. If multiple parts of a sentence are important, a human uses a single stroke to highlight them all, up to the whole sentence. If only a part of the sentence is relevant, she only picks out that particular sentence part.
Generating sub-sentence highlights is advantageous over abstraction (See et al., 2017;Chen and Bansal, 2018;Gehrmann et al., 2018;Lebanoff et al., 2018;Celikyilmaz et al., 2018) in several aspects. The highlights can be overlaid on the source document, allowing them to be interpreted in context. The number of highlights is controllable by limiting sentence selection. In contrast, adjusting summary length in an end-to-end, abstractive system can be difficult. Further, highlights are guaranteed to be true-to-the-original, while system abstracts can sometimes "hallucinate" facts and distort the original meaning. Our contributions in this work include the following: • we introduce a new task formulation of creating sub-sentence summary highlights, then describe (i): marseille , france -lrb-cnn -rrb-the french prosecutor leading an investigation into the crash of germanwings flight 9525 insisted wednesday that he was not aware of any video footage from on board the plane .
(ii): marseille , france -lrb-cnn -rrb-the french prosecutor leading an investigation into the crash of germanwings flight 9525 insisted wednesday that he was not aware of any video footage from on board the plane .  an annotation scheme to obtain binary sentence labels for extraction, as well as start and end indices to mark the most important textual unit of a positively labeled sentence; • we examine the feasibility of using neural extractive summarization with a multi-termed objective to identify summary sentences and their most informative sub-sentence units. Our study provides new benchmarks and baselines for highlighting at the sub-sentence level.

Annotating Sub-Sentence Highlights
We propose to derive gold-standard sub-sentence highlights from human-written abstracts that often accompany the documents (Hermann et al., 2015). However, the challenge still exists, because abstracts are very loosely aligned with source documents and they contain unseen words and phrases. We define a summary-worthy sub-sentence unit as the longest consecutive subsequence that contains content of the abstract. We obtain gold-standard labels for sub-sentence units by first establishing word alignments between the document and abstract, then smoothing word labels to generate subsentence labels.
Word Alignment The attention matrix of neural sequence-to-sequence models provides a powerful and flexible mechanism for word alignment. Let be a sequence of words denoting the document, and T ={w t } N t=1 denoting the abstract. The attention weight α t,i indicates the amount of attention received by the i-th document word in order to generate the t-th abstract word. All attention values (α) can be automatically learned from parallel training data. After the model is trained, we identify a single document word that receives the most attention for generating each abstract word, as denoted in Eq. (1) and illustrated by Figure 1 (i). This step produces a set of source words containing the content of the abstract but possibly with distinct word forms. 1 Smoothing Our goal is to identify sub-sentence units containing content of the abstract by smoothing word labels obtained in the previous step. We extract a single most informative textual unit from a sentence. As a first attempt, we obtain start and end indices of sub-sentence units using heuristics, which are described as follows: • connecting two selected words if there is a small gap (<5 words) between them. For example, in Figure 1 (ii), the gap between "crash" and "germanwings" is bridged by labelling all gap words as selected; • the longest consecutive subsequence after filling gaps is chosen as the most important unit of the sentence. In Figure 1 (iii), we select the longest segment containing 22 words. When a tie occurs, we choose the segment appearing first; • creating gold-standard labels for sentences and sub-sentence units. If a segment is the most informative, i.e., longest subsequence of a sentence and >5 words, we record its start and end indices. If a segment is selected, its containing sentence is labelled as 1, otherwise 0.

Dataset and Statistics
We conduct experiments on the CNN/DM dataset released by See et al. (2017) containing news articles and human abstracts. We choose the pointergenerator networks described in the same work to obtain attention matrices used for word alignment. gold-standard highlights and human abstracts, respectively, to validate system performance.
In Table 1, we present data statistics of the goldstandard sub-sentence highlights. We observe that gold-standard highlights and human abstracts are of comparable length in terms of tokens. On average, 28% of document sentences are labelled as positive. Among these, 47% of the words belong to gold-standard sub-sentence highlights. In our processed dataset we retain important document level information such as original sentence placement and document ID. We consider each document sentence as a data instance, and introduce a neural model to predict (i) a binary sentence level label, and (ii) start and end indices of a consecutive subsequence for a positive sentence. We are particularly interested in predicting start and end indices to encourage sub-sentence segments to remain self-contained. Finally, we leverage the document ID to re-combine model output to still generate summaries at the document level.

Models
We provide initial modeling for our data with a single state-of-the-art architecture. The purpose is to build meaningful representations that allow for joint prediction of summary-worthy sentences and their sub-sentence units. Our model receives an input sequence as an individualized sentence denoted as S={w s i } M i=1 , where s denotes the sentence index in the original document. The model learns to predict the sentence label and start/end index of a sub-sentence unit based on contextualized representations.
For each token w s i we leverage a combined representation E tok , E s-pos , and E d-pos , i.e., a token embedding, sentence level positional embedding, and a document level positional embedding. Here spos denotes the token position in a sentence, d-pos denotes the sentence position in a document, and E(w s i ) ∈ R d . We justify the last embedding by noting that the sentence position within that document plays an important role since generally there is a higher probability of positive labels towards the beginning. The final input representation is an element-wise addition of all embeddings (Eq. (2)). This input is encoded using a bi-directional transformer (Vaswani et al., 2017;Devlin et al., 2018), denoted as h.

Objectives
We use the transformer output to generate three labels: sentence, start and end positions of the subsentence unit. First we obtain the sequence representation via the [CLS] token. 2 We apply a linear transformation to this vector and a softmax layer to obtain a binary label for the entire sentence.
For the indexing objective we transform the encoder output, h, to account for start and end index classification. a = MLP start/end (h) ∈ R M×2 . Again we make use of a single linear transformation, here it is applied across the encoder temporally giving each time-step two channels. The two channels are individually passed through a softmax layer to produce two distributions, for the start and end index. Finally we use a combined loss term which is trained end-to-end using a cross entropy objective: For negatively labeled sentences L start and L end are not utilized during training. λ is a coefficient balancing between two task objectives.

Experimental Setup
The encoder hidden state dimension is set at 768, with 12 layers and 12 attention heads (BERT BASE uncased). We utilize dropout (Srivastava et al., 2014) with p = 0.1, and λ is empirically set to 0.1.   (See et al., 2017) and an extractive summarizer (Arumae and Liu, 2019) whose CNN/DM results are macro-averaged. The bottom two sections showcase our models. We report results at sentence and sub-sentence level and report those with and without E d-pos embeddings (+posit.). These results are further broken down to reflect evaluation against human abstracts and our own gold standard segments.
We use Adam (Kingma and Ba, 2014) as our optimizer with a learning rate of 3e −5 , and implement early stopping against the validation split. Devlin et al. (2018) suggest that fine-tuning takes only a few epochs with large datasets. Training was conducted on a GeForce GTX 1080 Ti GPU, and each model took at most three days to converge with a maximum epoch time of 12 hours. At inference time we only extract start and end indices when the sentence label is positive. Additionally if the system produced an end index occurring before the start index we ignore it and select the argmax of the distribution for end indexes which are located after the start index.

Results
In Table 2 we report results on the CNN/DM test set evaluated by ROUGE (Lin, 2004). We examine to what extent our summary sentences and subsentence highlights, annotated using the strategy presented in §2, have matched the content of human abstracts. These are the oracle results for sentences and segments, respectively. Despite that abstracts can contain unseen words, we observe that 70% of the abstract words are covered by goldstandard sentences, and 51% of abstract words are included in sub-sentence units, suggesting the effectiveness of our annotation method on capturing summary-worthy content.
We proceed by evaluating our method against state-of-the-art extractive and abstractive summarization systems. Arumae and Liu (2019) present an approach to extract summary segments using question-answering as supervision signal, assuming a high quality summary can serve as document surrogate to answer questions. See et al. (2017) present pointer-generator networks, an abstractive summarization model and a reliable baseline for being both state-of-the-art, and also a vital tool for guiding our data creation. We show that the performance of oracle summaries is superior to these baselines in terms of R-2, with sub-sentence highlights achieving the highest R-2 F-score of 31%, suggesting extracting sub-sentence highlights is a promising direction moving forward.

Modeling
Our models are shown in the bottom two sections of Table 2. We obtain system-predicted whole sentences (Sent) and sub-sentence segments (Segm); then evaluate them against both human abstracts (ABSTRACT) and gold-standard highlights (SUB-SENT). We test the efficacy of document positional embeddings (Eq. (2)), denoted as +posit.
Using R-2 as a defining metric, our model outperforms or performs competitively with both the abstractive and extractive baselines. We find that the use of document level positional embeddings is beneficial and that for both summary types, models with these embeddings have a competitive edge against those without. Notably sub-sentence level ROUGE scores consistently outmatch sentence level values. These results are nontrivial, as segment level modeling is highly challenging, often resulting in increased precision but drastically reduced recall (Cheng and Lapata, 2016).
Our model (+posit) positively labeled 22.27% of sentences, with an average summary length of 3.54 sentences. The segment model crops selected sentences, exhibiting a compression ratio of 0.77. Comparing to gold-standard ratio of 0.47, there is a 67.4% increase, pointing to future work on highlighting sub-sentence segments.

Conclusion
We introduced a new task and dataset to study subsentence highlight extraction. We have shown the dataset provides a new upper bound for evaluation metrics, and that the use of sub-sentence segments provides more concise summaries over full sentences. Furthermore, we evaluated our data using a state-of-the-art neural architecture to show the modeling capabilities using this data.