Interpretable Multi-headed Attention for Abstractive Summarization at Controllable Lengths

Abstractive summarization at controllable lengths is a challenging task in natural language processing. It is even more challenging for domains where limited training data is available or scenarios in which the length of the summary is not known beforehand. At the same time, when it comes to trusting machine-generated summaries, explaining how a summary was constructed in human-understandable terms may be critical. We propose Multi-level Summarizer (MLS), a supervised method to construct abstractive summaries of a text document at controllable lengths. The key enabler of our method is an interpretable multi-headed attention mechanism that computes attention distribution over an input document using an array of timestep independent semantic kernels. Each kernel optimizes a human-interpretable syntactic or semantic property. Exhaustive experiments on two low-resource datasets in English show that MLS outperforms strong baselines by up to 14.70% in the METEOR score. Human evaluation of the summaries also suggests that they capture the key concepts of the document at various length-budgets.


Introduction
Great progress has been made in recent years on abstractive summarization of text documents. Among existing works, sequence-to-sequence networks with attention (Gehring et al., 2017;Liu et al., 2018a) have been one of the clear front-runners. Being able to constrain the length of a summary while preserving its desirable properties has many real-world applications. One such application is content optimization for variable screen-sizes. Online content creators such as news portals, blogs, and advertisement agencies with audiences on multiple platforms customize their content based on display-area for best experience. However, there has not been much work on summarization at controllable lengths until recently. High variance in screen-sizes often require extensive human supervision to perform these modifications. As most sequence-to-sequence networks (Rush et al., 2015;Nallapati et al., 2016) do not enforce the length of a summary, for scenarios as mentioned above, one may need to employ an ensemble of networks to cover all possible lengths. There are two major challenges in following this approach for realworld applications. First, training sequence-to-sequence networks is a resource-intensive task (Strubell et al., 2019). To train a network for generating summaries budgeted at length b, we need a parallel corpus of text documents and their gold-standard summaries at length b. Constructing a large enough corpus with summaries budgeted at b, ∀b ∈ (0, 1) may not be possible and/or cost-efficient for a number of domains. This is one of the main reasons why most existing works on abstractive summarization evaluate their model on large-scale news corpus datasets (Nallapati et al., 2016;Hermann et al., 2015), leaving out a number of important but low-resource domains (Magooda and Litman, 2020;Parida and Motlicek, 2019) where the number of available training documents is limited. Second, the range of possible lengthbudgets R(b) may not always be known beforehand. In many scenarios, it can be known as late as during run-time. Therefore, we formalize the summarization task addressed in this paper as follows.
Problem Definition: Given a document S of length N (tokens) and a maximum token budget of b, we aim to construct an abstractive summary s b that satisfies the following conditions, C1: information Input text police are hunting a man aged between 50 and 60 suspected of robbing a bank in broad daylight and running off with £3,000 in cash. the robbery took place at 12.30pm at a lloyds bank branch in fairwater, cardiff, police said. detectives have issued cctv images of the suspect, who is 50 to 60, 5ft 9in to 6ft and was wearing black clothing. the white male suspect, who has greying black hair and wore glasses, was captured on camera inside the bank. detectives said no one was injured during the robbery and they were 'confident' the public would be able to identify the suspect. detective sergeant andy miles, from fairwater cid, said: 'inquiries are continuing to identify the culprit. the cctv is clear and i am confident that members of the public will know his identity... '. (truncated) Summary at compression budget = 1 2 police are hunting a man aged between 50 and 60 suspected of robbing a bank in broad daylight and running off with £3,000 in cash. the robbery took place at 12.30pm at a lloyds bank branch in fairwater, cardiff, police said. the white male suspect, who has greying black hair and wore glasses, was captured on camera inside the bank. detectives have issued cctv images of the suspect, who is 50 to 60, 5ft 9in to 6ft and was wearing black clothing. detective sergeant andy miles, from fairwater cid, said: 'inquiries are continuing to identify the culprit.
Prototype Summary robbery took place at 12.30pm at a lloyds bank branch in fairwater , cardiff. detectives have issued cctv images of the suspect , who is 50 to 60. detective sergeant andy miles , from fairwater cid , said : ' inquiries are continuing to identify the culprit. Figure 1: MLS expands the highlighted sentences in the prototype summary to the boldfaced tokens in the input text to construct a summary budgeted at half-length of the input text redundancy is minimized in s b ; C2: coverage of the major topics of S is maximized in s b ; C3: length of s b is maximal within the specified budget b without adversely affecting the conditions C1 and C2 i.e., and C2 ensure that the properties of a high-quality summary is preserved in s b , whereas C3 ensures that s b is the largest possible summary that can be constructed within budget b without compromising its quality. Note that C1 and C2 are seemingly contradictory to each other as the length of the summary increases. Our goal is to find the optimal tradeoff.
Early works on incremental summarization (Buyukkokten et al., 2001;Yang and Wang, 2003) leveraged structural tags supported by document markup languages to generate summaries at various lengths, thus imposing a serious constraint on the document formats (e.g. XML, HTML) that come under the purview of such methods. Incremental sampling of sentences based on a salience score (Otterbacher et al., 2006;Campana and Tombros, 2009) can partially solve this problem by constructing extractive summaries of the input document. We show in Section 3 that these sampling-based methods often fail to preserve the desirable properties of a high-quality summary. Among recent works, (Kikuchi et al., 2016) were the first to propose a supervised method for controlling length during abstractive summarization. Their work was later extended by  who introduced the length of a summary as an input to the network. However, instead of exact input, they approximate the length to a predefined value-range, often failing to adhere to the allocated budget in a number of cases. (Liu et al., 2018b) address this issue by proposing a convolutional encoder-decoder network, introducing the desired summary length as an input to the initial state of the decoder. We compare and report its performance on two datasets in our experimental setup in Section 3.
Unfortunately, when it comes to interpreting 1 these models i.e. how the summaries came to be, the answer still remains illusive. Explaining how a machine-generated summary was constructed, has become a necessity under the newly introduced General Data Privacy Regulation Act (ITGP, 2017), especially for applications in enterprise (Sarkhel and Nandi, 2019;Keymanesh et al., 2020) and biomedical domain (Moradi and Ghadiri, 2018;Sarkhel et al., 2018). Some recent efforts have proposed using interpretable heatmaps (Baan et al., 2019) generated from the attention distribution over an input sequence for interpreting model behaviour. However, they are still quite limited (Jain and Wallace, 2019) in consistently explaining all aspects of a neural summarizer. This leaves a gap in the ongoing efforts (Song et al., 2020a;Song et al., 2020b) to generate abstractive summaries that are guided by human-interpretable semantic/syntactic qualities. Briefly, the main goal of attention mechanism in a encoder-decoder network is to assign a softmax score to every encoder hidden state (based on its relevance to the token being decoded) and amplify those that are assigned high scores through a weighted average. Sourcetarget attention (Nallapati et al., 2016) relies on another sequence for computing these scores, whereas self-attention (Vaswani et al., 2017;Paulus et al., 2018) operates over the elements in the current input sequence. A multi-headed attention mechanism allows a neural model to speed up training by enabling parallelization across timesteps. The number of operations in the computation of self-attention, however, scales quadratically with input length, making it a computationally expensive operation for long input sequences. Training such a network for a summarization task would require a large parallel corpus of input documents and their corresponding gold-standard summaries budgeted at b. The role of some of the attention-heads during abstractive summarization is also not transparent (Baan et al., 2019). To address these, we replace self-attention with a lightweight, interpretable alternative. Instead of projecting each input sequence multiple times 2 at every timestep, we encode an input sequence only once, using a timestep-independent kernel ( Q) learned in an unsupervised or distantly supervised way from the input document. Each kernel has a human-interpretable syntactic/semantic role. Every attention-head in this multi-headed mechanism computes an attention distribution over the input sequence using a unique kernel Q i , recycling it at every timestep. Compared to self-attention, our proposed attention mechanism scales linearly with the input sequence length and leverages significantly less number of trainable parameters. As we will show in Section 3, this allows us to train our network on limited training samples in low-resource datasets. We propose MLS -a supervised method to generate abstractive summaries at arbitrary lengths in this paper. It computes a lengthconstrained summary s b budgeted at length b by soft-switching between a copy and expand operation over a prototype summary s p constructed from the document.
The key enabler in this process is an interpretable, multi-headed attention mechanism. We develop a length-aware encoder-decoder network, called the Pointer-Magnifier network that leverages this attention mechanism to construct summaries within a specified length. We train our network on limited training samples from two cross-domain datasets: the MSR-Narrative (Ouyang et al., 2017) and Thinking Machines dataset (Brockman, 2018). Exhaustive evaluation on a range of success metrics shows that MLS performs competitively or better against strong baseline methods. Subsequent human evaluation of summaries generated by MLS suggests that they accurately capture the main concepts of the input document. To summarize, some of the major contributions of this work are as follows: • We propose MLS, a supervised approach to generate abstractive summaries of a text document at controllable lengths.
• We develop a length-aware encoder-decoder network that leverages an interpretable, multi-headed attention mechanism to construct length-constrained summaries.
• Experimental results on two cross-domain datasets show that trained on limited training samples, MLS was able to generate summaries that are coherent and captured the key concepts of a document.

Proposed Methodology
MLS constructs a length-constrained summary of a document in two steps. First, it derives a prototype summary s p from the document, covering its major concepts. Then, it expands or shortens it, depending on the length-budget to create the final summary. We employ a pair of encoder-decoder networks at both steps. For the first step, we extend the PG-network (See et al., 2017). We develop a length-aware encoderdecoder network for the second step. We describe both steps in greater detail in the following sections.

Generating the Prototype Summary
We extend PG-Network by (See et al., 2017) to construct the prototype summary s p of a document. We tokenize the document and feed it to the encoder network sequentially. As the encoder hidden states are updated, the decoder network constructs the prototype summary one token at a time by soft-selecting between tokens in the input document and an external vocabulary. The decoding process is guided by an attention distribution 3 computed over the input document and the external vocabulary. An overview of this network is shown in Fig 2. We point the readers to the work by See et al. for more background on this network. An example prototype summary is shown in Figure 1. Contrary to existing prototype-text guided summarization methods (Liu et al., 2019;Saito et al., 2020), we do not specify the length of the prototype summary as an input of the network, rather infer it by outputting tokens until the EOS token is produced. We discuss the training and parameter settings of the network used in our experiments in Section 2.3. It is worth mentioning here that one of the main reasons to select the PG-Network as our architecture of choice for this step is due to its capability to construct a summary by looking up a learned language model. Other networks with similar capabilities can also be used, as this step has a transitive effect on the next phase of our approach.

Constructing the Length-Constrained Summary
To construct a summary within length-budget b, we develop the Pointer-Magnifier network: a lengthaware, interpretable, encoder-decoder network. An overview of the network is shown in Fig. 2. It consists of a multiplex layer, an encoder (yellow rectangles) layer and a decoder (green rectangles) layer. The encoder layer takes the prototype summary constructed in the previous step as input. The decoder layer outputs the final summary. We describe each layer in detail below.
A. The Multiplex Layer and Interpretable Kernels: In an effort to build a transparent network, we embody three qualitative properties that are associated with a high-quality summary in our network. A high-quality summary, (1) maximizes the coverage of the major topics (Φ 1 ) and (2) keywords (Φ 2 ) appearing in the input document, while (3) minimizing the amount of redundant information (Φ 3 ). We encode each property using a semantic kernel ( Q i ), learned using an unsupervised or distantly supervised way from the input document itself. Every kernel plays a unique, human-interpretable syntactic/semantic role in constructing the final summary. One of the key components in this process is the multiplex layer M. Physically, it is a nested matrix of dimensions 3 × 3 shared between the encoder and decoder layer. Each row in M contains the following information: During inference, each of these kernels measures the contribution of every sentence in the prototype summary towards optimizing one of the properties Φ i , 1 ≤ i ≤ 3, mentioned above. w i represents the relative weight assigned to the property Φ i in constructing the final summary. We compute the kernels as a preprocessing step.
Defining the Kernels: To encode the property φ 1 , we define Q 1 as a matrix of dimensions 3 × 300, where each row of Q 1 represents one of the three most dominant topic vectors of the input document as a 300-dimensional vector. We use an unsupervised LDA-based model (Blei et al., 2003) to derive these topic vectors. Symmetric KL-divergence is used as the distance metric (dist 1 ). Similarly, we encode the property φ 2 as a single dimensional vector Q 2 of length 50, where each vector component represents the relative frequency of one of the 50 most frequent keywords in the input document. We use RAKE (Rose et al., 2010), a publicly available library to identify the keywords of a document.
Symmetric KL-divergence is used as the distance metric (dist 2 ). Finally, we encode φ 3 as a matrix Q 3 of dimensions p × 300, where the i th row of Q 3 represents an embedding of the i th sentence in the input document. We compute the embedding vector of each sentence using a pretrained model (Le and Mikolov, 2014) on English Wikipedia corpus. Cosine similarity is used as the distance metric (dist 3 ). Our choice of unsupervised/distantly supervised kernels reflects our motivation (see Section 1) to leverage a limited number of training samples from the experimental dataset to construct the final summary. We discuss the role played by each semantic kernel ( Q i ), distance metric (dist i ), and weight (w i ) in constructing the final summary from s p in the following section. Each encoder-block (see Fig. 3) contains an embedding layer and a local-attention layer. At every timestep t, a sentence from s p is fed into the embedding layer of each of the three encoder-blocks. It computes a fixed-length embedding ( V i ) of the sentence and propagates it to the local-attention layer. Each encoder-block in our network is mapped to a unique triplet ( Q i , dist i , w i ) in the multiplex layer. To compute local-attention (c i ) attributed to a sentence in s p by the i th encoder-block, we embed it in the same semantic space as Q i and compute its distance from Q i in that encoding space (Eq. 1).
In Eq. 1, Q i represents a kernel of dimensions r × n i and V i represents an embedding vector of length n i . The embedding layer represents each sentence in s p in the same encoding space as the kernel Q i associated with that block. We compute the local-attention c i by taking a column-wise average of the distance-matrix C t,i (Eq. 2). The kernel Q i is reused for all the sentences fed to the i th encoder-block. The distribution [c 1 , c 2 ...] obtained this way is then normalized to derive the local-attention distribution C i over s p . The final attention distribution ( A) over s p at timestep t is computed by normalizing the weighted average (Eq. 3) of local-attention distributions computed by each attention-head.
It is worth noting here that attributing each encoder-block with a distinct attention-head ensures that there is a dedicated pathway to compute local attentions for every encoder-block. This allows us to parallelize the network and speed-up the decoding process when constructing the final summary.
C. The Decoder Layer: Similar to the encoder, the decoder layer also consists of 3 parallelly stacked decoder-blocks. Each decoder-block contains an embedding layer and a local-attention layer. Parameters of the i-th encoder-block and i-th decoder-block are shared. We construct a length-constrained summary s b of the input document by processing each sentence in s p sequentially. Depending on the remaining length-budget at each timestep, the final summary is constructed by soft-switching between a copy and expand operation. This process is guided by a sentence-level attention distribution (Eq. 3) computed over s p . If the copy operation is selected, a sentence from s p is copied into the final summary, whereas the expand operation replaces a sentence with similar content from the input document in s b . The original ordering of sentences is preserved.
The Copy Operation: The probability of copying a sentence s from the prototype summary that has not been included in the final summary (s b ) till timestep t into s b is defined as follows: P c (s) = A t [s], where A t represents a sentence-level attention distribution over s p at timestep t. Initialized as A * (Eq. 3), we update the attention distribution at each timestep after a copy or expand operation. If s * = argmax(P c (s)) represents the sentence copied into s b at timestep t, we update the attention distribution by zeroing out the probability of s * in A t and renormalizing the resulting distribution.
The Expand Operation: If the length of our prototype summary (s p ) is less than the length-budget b, MLS can choose to expand a set of sentences from s p . For each sentence s ∈ s p , we define its expansionset E(s) as the sentence n-gram that is most similar to s in the input document. We determine the expansion-set E(s) of a sentence s by using beam-search over all n-grams in the input document that are yet to be included in the final summary. Our search objective being maximizing score(E) = sim(s, E)× overlap(s, E). The first term in score(E) denotes the average pairwise cosine similarity between s and the sentences in E(s), whereas the second term denotes the fraction of tokens in s that appear in E(s). To minimize across-sentence repetitions in the summary, top 4 candidates identified from the search process are re-ranked (Chen and Bansal, 2018) based on the number of repeated word bigrams and trigrams if the expansion-set is included in the final summary. We obtained best performance by initializing n with 3 and changing it to 2 at later iterations of the decoding process. If v k i denotes the embedding-vector of the k-th sentence in E(s) computed by the embedding-layer of the i-th decoder-block, we define the probability of expanding a sentence s from the prototype summary to E(s) in the final summary as follows.
In Eq. 4, Q i denotes the semantic kernel shared between the i-th encoder-block and decoder-block. We compute the probability of including the k th sentence of E(s) into the final summary by computing its contribution (c e i,k ) towards optimizing the qualitative property Φ i encoded by Q i first (Eq. 5). Repeating this process for all the sentences in E(s), followed by normalization provides us with the distribution c e i = (c e i,1 , c e i,2 , ...). Here, c e i represents the probability distribution over E(s). To obtain the expansion probability of a sentence in E(s), we repeat this process for all 3 attention-heads and average them (Eq. 6). The probability P e (s) of expanding a sentence s from the prototype summary is obtained by averaging the expansion probability of all sentences in E(s). Once a sentence s has been expanded into the final summary, we update the attention distribution by zeroing out the probability at s and renormalizing the resulting distribution.

Soft-Selection between Copy and Expansion:
We define the probability p o (s) of selecting between the copy and expand operation for a sentence s in the prototype summary as follows.
In Eq. 8, s b * denotes the partially constructed summary till timestep t. If the length-budget b is smaller than the length of the prototype summary s p , the probability of including a sentence from s p into the final summary depends on the attention distribution A t over sentences in s p that are not included in the final summary till timestep t. In all other scenarios, α acts as a soft-switch between copying or expanding a sentence in s p . A sentence can be expanded only if doing so does not exceed the length-budget. Once the probability of each sentence (and/or its expansion set) has been computed, the decoder attends to the position with the highest probability and copies/expands it into the final summary. Generation stops once len(s b * ) reaches b. We observed that the probability of expanding a sentence from the prototype summary (instead of copying it) increases with the allocated length-budget.

Training the Networks
We trained PG-Network and the Pointer-Magnifier network separately on a NVIDIA Titan-XP GPU with a batch size of 16. We pretrained the PG-Network on the CNN-DailyMail dataset (Nallapati et al., 2016) and then fine-tuned it on training samples of our experimental datasets. Using the evaluation script provided by (Nallapati et al., 2016), we obtained a training set of 287,226 pairs and validation set of 13,368 pairs for this dataset. All encoder-decoder weights were allowed to be updated during fine-tuning stage, following a L1-transfer (Pan and Yang, 2009) of weights from the pretrained network. The external vocabulary used in both pretraining and fine-tuning stage consisted of 80K most frequent tokens in the training samples of the CNN-DailyMail dataset, our experimental dataset or both. Learning-rate and initial accumulator values were set to 0.15 and 0.1 respectively. We used Adagrad (Duchi et al., 2011) to train the network. The encoder was fed a maximum 400 tokens and the decoder generated 100 tokens during pretraining. These values were increased to 500 and 200 respectively during fine-tuning. To prevent overfitting, we stopped training after 3000 iterations during the fine-tuning stage. With respect to the Pointer-Magnifier network, we learn the optimal values of w i , 1 ≤ i ≤ 3 associated with each attention-head by grid-searching over the interval [-1,1] with the learning objective of maximizing ROUGE-1 score on the validation set. The optimal weights assigned to the attention-head corresponding to topic-coverage (φ 1 ) and keyword-coverage (φ 2 ) were positive, whereas information redundancy (φ 3 ) was assigned a negative weight for both of our datasets.  We seek to answer three key questions in our experiments. Given a length-constrained summary s b , (a) how similar is s b to a gold-standard summary?, (b) is it coherent and representative of the input document? and (c) how abstractive is s b ? We answer the first two questions by evaluating the summaries generated by MLS over a range of success metrics on datasets belonging to two low-resource domains. We also conduct a user study to measure how representative are the summaries with respect to the input documents. A representative summary covers the main topics of the document. We answer the third question by computing the percentage of n-grams in s b that do not appear in the input document and/or generated from the external vocabulary.
A. Datasets: We evaluate MLS on two publicly available datasets from two low-resource domains: the MSR-Narrative (Ouyang et al., 2017) (D1) dataset and the Thinking-Machines (Brockman, 2018) (D2) dataset. The MSR-Narrative dataset contain personal stories shared by users on a social networking website. The Thinking-Machines dataset, on other hand, contains position papers on a popular scientific topic published in an educational website. Each document in both datasets is paired with a gold-standard summary. We randomly selected 25% document-pairs to construct the training set and 10% document-pairs to construct a validation set for both datasets. The rest comprised the test corpus. We present an overview of some of the important properties of both datasets in Table 1.

B. Metrics:
We compare the summaries constructed by MLS against gold-standard summaries using METEOR (Banerjee and Lavie, 2005) and ROUGE (Lin, 2004) scores 4 . The average F 1 score of ROUGE-1, ROUGE-2 and ROUGE-L metrics obtained for both datasets are shown in Table 2. To measure the representativeness of a summary, we compute the average KL-divergence score between  Table 2: ROUGE and METEOR scores of the budgeted summaries constructed by MLS (highlighted column) and the baseline methods for the MSR-Narrative (D1) and Thinking Machines (D2) dataset the top-3 topic vectors of a summary and its input document. Following (Srinivasan et al., 2018), we measure the coherence of a summary by computing the average cosine similarity between consecutive sentences. We report the absolute difference between the coherence score computed for a summary and its input document in Table 3. We also report the KL-divergence score between sentiment vectors of a summary and the input document to check for potential biases in its polarity distribution. We used a publicly available library (Hutto and Gilbert, 2014) to derive the sentiment vectors.. Note that, lower values of ∆Coherence and KL-divergence score are desirable for a high-quality summary.
C. Baselines: We compare MLS against three baseline methods. Two of them follow a sampling based approach, while our final baseline method employs a convolutional network to construct length budgeted summaries. Our first baseline (A1) follows a systematic sampling based approach to construct length-controlled summaries. Initialized with a randomly selected sentence from the first k-1 sentences of the input document, it constructs the final summary by including the k-th sentence from the last sampled position. We set k = 3 in all of our experiments for both datasets. Sampling terminates when the budget limit is exceeded or the end of document is reached. Our second baseline method (A2) follows a weighted graph-based sampling strategy to construct budgeted summaries. It represents each sentence in the input document as a node in an undirected, complete, weighted graph. The weight assigned to an edge in this graph is equal to the pairwise cosine similarity between the connecting nodes. To construct the budgeted summary, we sample the top-K nodes of this graph using a weighted PageRank algorithm (Mihalcea and Tarau, 2004). Sampling stops when the budget is reached. Our third and final baseline method (A3) is a convolutional approach proposed in (Liu et al., 2018b). It is a sequence-to-sequence network with Gated Linear Units  that takes the desired length of a summary as an additional input to the initial state of the decoder network. Similar to our training protocol, we pretrain this network on the CNN-DailyMail dataset first and fine-tune it on the training samples from both of our experimental datasets. We allowed all weights to be updated during the fine-tuning phase.

Results and Discussion
We report the performance of all competing methods at five length-budgets. We specify the length-budget to construct a summary as a product of the number of tokens in the input document and a compressionbudget c ∈ { 1 32 , 1 16 , 1 8 , 1 4 , 1 2 }. Results from our experiments are presented in Tables 2 and 3. The best performance achieved for each metric is boldfaced. We highlight some of our key findings below.

Qualitative Evaluation at five Compression Budgets
In general, the abstractive methods (MLS and A3) outperform sampling-based approaches (see Table 2) on both datasets. MLS performs consistently well on all budgets, although performance is relatively better on smaller budgets. We obtain an absolute improvement of 4.34% and 4.65% in ROUGE-1 score & 1.61% and 5.17% in METEOR score over the convolutional baseline (A3) for datasets D1 and D2 at compression budget = 1 32 . At higher budgets, our performance was comparable with A3. In terms of coherence, MLS performs comparably or better than A3 (see Table 3). Smaller ∆Coherence score than A1 and A2 suggests that MLS generated more coherent summaries than these two baseline methods. Small KL-divergence between the topic distribution of a budgeted summary and input document shows that MLS generated summaries are representative of the document for both datasets. In fact, topic-coverage in summaries generated by MLS is at least 75% better than the convolutional baseline (A3) (Liu et al., 2018b), although performance becomes comparable at larger budgets as more   Table 3: Coherence and completeness of the budgeted summaries constructed by MLS (highlighted column) and the baseline methods for MSR-Narrative (D1) and Thinking Machines (D2) dataset sentences from the prototype summary are expanded to make the final summary. MLS outperfoms A1 and A2 in terms of staying true to the sentiment distribution of the input document. This can be seen from the small KL-divergence scores obtained for the sentiment distribution achieved by MLS in Table 3. MLS generated summaries were more abstractive at higher budgets (Fig. 4). At compression budget = 1 2 , 27.35% tokens in the summaries constructed for dataset D1 and 8.75% tokens for dataset D2 were contributed by the external vocabulary.

Ablative Analysis
To investigate the effects of pretraining on end-to-end results, we compare the ROUGE-1 score of summaries constructed by MLS against an ablative baseline MLS*. It is identical to MLS except that the PG-Network was not pretrained. In our second experiment, we compare MLS against MLS+, an ablative baseline that constructs the prototype summary following a greedy heuristics (Otterbacher et al., 2006) instead of the PG-network. MLS outperforms both baselines (Fig. 5) on both datasets, thereby establishing that using PG-Network in our framework and pretraining it on the CNN-DailyMail dataset improved the quality of our final summaries. Finally, to investigate the effects of the semantic kernels introduced in the Pointer-Magnifier network, we iteratively replaced each of the three semantic kernels (Section 2.2) with a randomized kernel by shuffling its rows and columns. We observed an absolute decrease of up to 4.30% in ROUGE-1 score and 3.75% in METEOR score for Q 3 , with bigger impacts in performance at higher length-budgets. Replacing Q 2 with a randomized kernel, on other hand, decreased the average ∆Coherence score by approximately 45% for dataset D1 and 30% for D2 for summaries constructed at compression budget = 1 2 , i.e. half-length of the input document.

Human Evaluation of Length-Controlled Summaries
We conducted a study to evaluate the completeness of the summaries constructed by MLS. More specifically, we considered a scenario where the user needs to complete a fact checking task. We chose three documents from both datasets randomly and asked each participant to verify the presence of some key facts of the document in the summaries constructed by MLS and/or a baseline method. Each participant was instructed to complete the task solely based on the content of the summary and not depending on any previous knowledge. For example, the question "Does the story tell us why the narrator was fired?" was paired with the following summary-"I tried to return a lost wallet to a customer who accused me of stealing it and then grabbed my hair. We got in a physical fight and I was fired from my job". The participants had to chose between 'Yes', 'No", and "More information required". If a participant selected the third option, a longer summary was shown with the same question. The task was terminated otherwise. In addition to MLS, A2 (the stronger extractive baseline in our experimental setup) and A3, we add two extreme settings: (a) the full-content setting in which the original document was shown, and (b) the no-content setting where no textual content (other than the question itself) was shown to a participant. The full-content setting ensured that the question could indeed be answered from the article, whereas the no-content setting ensured whether the questions contained any hint about the answer.  Table 4: Mean accuracy and completion time using MLS, A2, A3, No-content (NC) and Full-content (FC) settings The task started by showing each participant a summary generated at compression budget = 1/32. If they opted for more information to be shown, we provided a summary generated by the same method by doubling the compression budget each time until the user responded with a 'Yes' or 'No' or we reached the budget of 1/2. The key intuition here is that if users are given a complete and representative summary, they should be able to answer the questions accurately, as a good summarization model would pick up the key concepts of the document even at shorter length-budgets, without requiring for it to be expanded further. With this in mind, we recorded task completion time and user response for each treatment. All budgeted summaries were constructed beforehand. We invited 22 graduate students to participate in the study. Each participant was shown summaries generated by at most two different methods in random order. No information on the method used was revealed to a participant at any stage. To prevent information retention, each participant was shown a summary generated from the same document only once. Using a balanced, incomplete block design (Aschbacher, 1971), each of the 10 settings (5 methods × 2 datasets) was assigned to 3 subjects. The average accuracy and task completion time recorded for each treatment is shown in Table 4. The accuracy of the no-content setting is 0 for both datasets, indicating that the questions did not contain any hint to the correct answer, whereas the full-content setting shows that overall the questions could have been answered from the original documents. When using summaries generated by MLS, the participants responded as accurately as the Full-content setting on dataset D1, while being more than two times faster, outperforming A2 and A3. For dataset D2, participants were more accurate using summaries constructed by MLS than A2. MLS performed better than A3 on one document, comparable on one and worse on one document, with an average accuracy of 0.55.

Conclusion
We have proposed MLS, a supervised approach to construct abstractive summaries at controllable lengths. Following an extract-then-compress paradigm, we develop the Pointer-Magnifier networka length-aware, encoder-decoder network that constructs length-constrained summaries by shortening or expanding a prototype summary inferred from the document. The key enabler of this network is an array of semantic kernels with clearly defined human-interpretable syntactic/semantic roles in constructing the summary given a budget-length. We train our network on limited training samples from two cross-domain datasets. Experiments show that the summaries constructed by MLS are coherent and reflectively capture the main concepts of the document. Our human evaluation study also suggest the same. In the future, we would like to extend our work to construct task-driven summaries for interactive question answering tasks. Personalizing a summary based on user's past interaction model is another exciting direction of future work.

Acknowledgement
We would like to thank Professors Micha Elsner, Joel Bloch, Marie-Catherine de Marneffe, and Michael White for valuable discussions and comments. We would also like to thank the reviewers and folks who participated in our study for sharing critical feedback that helped improve our work.