Controlling Length in Abstractive Summarization Using a Convolutional Neural Network

Convolutional neural networks (CNNs) have met great success in abstractive summarization, but they cannot effectively generate summaries of desired lengths. Because generated summaries are used in difference scenarios which may have space or length constraints, the ability to control the summary length in abstractive summarization is an important problem. In this paper, we propose an approach to constrain the summary length by extending a convolutional sequence to sequence model. The results show that this approach generates high-quality summaries with user defined length, and outperforms the baselines consistently in terms of ROUGE score, length variations and semantic similarity.


Introduction
Great progress  has been made recently on abstractive summarization. Many use sequenceto-sequence model based on RNN and attention mechanism , which was originally used for machine translation Bahdanau et al., 2014). Recently,  proposed a convolutional sequence to sequence model equipped with Gated Linear Units , residual connections (He et al., 2016) and attention mechanism. Such a convolutional model achieves stateof-the-art accuracies in abstractive summarization on single sentence summarization, and it is much faster than the previous recurrent models as it can be easily parallelized. Furthermore, unlike recurrent models, the convoluational model has more stable gradients because of its backpropagation path.
Constraining summary length, while largely neglected in the past, is actually an important aspect of abstractive summarization. For example, given the same input document, if the summary is to be displayed on mobile devices, or within a fixed area of advertisement slot on a website, we may want to produce a much shorter summary. Unfortunately, most existing abstractive summarization models are not trained to react to summary length constraints. When the constraint is given at test time, the current practice is i) to truncate the generated summary after N tokens are generated when you want the summaries of length no more than N , and ii) ignore EOS (end of summary) token until the first M tokens are generated when you want the summaries of length at least M . Such a crude way of controlling summary length makes the output summary incomplete or incoherent.
Previous research on controlling length of abstractive summary has been scarce. , who applies convolutional sequence to sequence model on multi-sentence summarization, converts length range as some special markers which are predefined and fixed. These markers are included in the training vocabulary. At training time, the model prepends the input of the summarizer with marker indicating the length of input sequence. At test time, it controls the length of the generated summary also by prepending length marker indicating the desired length. Unfortunately, this approach can not generate summaries of arbitrary lengths. It only generates summaries in predefined ranges of length, thus only meets the length constraints approximately. This is shown in Table 1. The above truncation practice can be used in conjunction with any of the length control methods but the excessive parts (red) will be truncated leaving incomplete sentences.
In our work, we extend the convolutional sequence to sequence model  by controlling the length of summarization. Our approach seeks to generate summaries of any de- Table 1: Example summaries generated by different models with a desired length of 10 (red parts exceed the 10 token limit).
Reference summary (53 tokens) david de gea and victor valdes enjoyed an afternoon off at a theme park . spanish duo donned shades as they made the most of the rare sunshine . it has certainly been a rollercoaster season for manchester united . united are third in the premier league after an impressive recent run . Basic CNN summary (35 tokens) david de gea and victor valdes made the most of the rare english sun with a trip to a theme park . david de gea and victor valdes enjoyed some fun in the sun .  summary (30 tokens) david de gea and victor valdes enjoyed a trip to a theme park . the pair enjoyed a relaxing time just days after united 's win against manchester city . Our Length Control summary (LC) (10 tokens) david de gea and victor valdes enjoy some fun . sired number of tokens (also shown in Table 1). To do this, a length constraint is added to each convolutional block of the initial layer of the model. This information is propagated layer by layer during training. Our contributions are as follows: 1. We propose a simple but effective method to generate summaries with arbitrary desired length (Section 2.2).
2. Our approach outperforms the state-of-art baseline methods substantially by all evaluation metrics, i.e., ROUGE scores, length variation and semantic similarity (Section 3).
3. The generated summaries from our model are natural and complete, especially when the desired length is short (Section 3).
Next, we present the basic convolutional sequence to sequence model and our extension, followed by the evaluation of our approach and a discussion of related work.

Methodology
In this section, we will describe the model architecture used for our experiments and propose our length control method which is implemented by extending the basic model.
For summarization problems based on seq2seq model, given a sequence of tokens x = (x 1 , x 2 , ..., x m ) in the source document and a sequence of tokens y = (y 1 , y 2 , ..., y n ) in the target summary (i.e. m > n), the goal is to estimate the conditional probability p(y|x): p(y|x) = T t p(y t |y 1 , y 2 , ..., y t−1 , x) (1) We aim at getting the above conditional probabaility which can generate summaries with arbitrary desired length.

Basic CNN seq2seq Model
Our basic model consists of a multi-layer convolutional sequence to sequence model (CNN seq2seq) 1 LeCun et al., 1989) and an attention mechanism. Figure 1 illustrates the model. In the CNN seq2seq model, we obtain the input sequence X = (X 1 , ..., X m ) and output sequence Y = (Y 1 , ..., Y n ) after combining word vectors with their absolute positions in the document. We use z = (z l 1 , z l 2 , ..., z l m ) and h = (h l 1 , h l 2 , ..., h l n ) to denote the convolutional output of the encoder and decoder in l-th layer. Each element of the output sequence generated by the decoder network is fed back into the next layer of decoder network. Next, we add GLU  and residual connections (He et al., 2016) in each layer: where [h l−1 s , ..., h l−1 t ] corresponds to the h l i in the convolutional layers. The choice of s and t is based on kernel width and the padding method used to match the output of convolutional layers to the input length. We compute the probability distribution of generating the next elements y i+1 based on the current state and transform the top decoder output h L i via softmax: p(y i+1 |y 1 , ..., In addition, a multi-step attention mechanism that connects the encoder and decoder is used in each decoder layer. We define the decoder state d l i for attention as following: The attention c l i is a weighted sum of the encoder outputs. The weights a l ij are based on the decoder states.
At last, we add c l i to the current decoder elements h l i , which forms the final output or the input of the next layer in the decoder.

Modified Model with Length Control (LC)
We propose an approach which can control the summary length in CNN seq2seq model. The model can generate different summaries by setting desired length. It has the ability to generate the EOS tag at the appropriate time point in a natural manner.
To produce a summary of a given desired length, we modify the basic model by feeding the desired length as a parameter into the decoder of the CNN seq2seq model. At training time, we use the true length of the gold summary as the desired length. At test time, we can give any desired length len to the model and obtain a summary with length approximate to len. The modified decoder is shown in Figure 2. The CNN seq2seq model creates hierarchical structure over the input sequence. It is capable of capturing the correlation between elements over short distances at lower layers and between elements over long distances at higher layers. The useful information among the elements is aggregated after GLU. Therefore, we set the desired length as an input to the initial state of the decoder: (7) where W is a trainable parameter, len is the desired length, v is GLU funciton and h 0 i is the i-th element in the initial layer.
In the above function, we add length information at first layer in CNN model. GLU is like a gate. It can filter some information from a particular unit in each layer. The information attenuation occurs in GLU layer by layer. Different desired lengths have different degrees of information attenuation. Therefore the model is able to learn the probability of generating EOS with its own length information attenuation. This operation enables the model to produce a natural and complete summary for a given length constraint naturally.

Evaluation
In this section, our benchmark is the CNN/Daily Mail DMQA dataset (Hermann et al., 2015; 2 , consisting of pairs of a single source document and a multisentence summary. The dataset includes 286,817 training pairs, 13,368 validation pairs and 11,487 test pairs. We follow the same pre-processing step used by , and fill in the blanks with answer named entities. We show an example of such pairs in Table 4(a).
We compare our length constrained summarization model with the basic CNN seq2seq model and the state-of-the-art length controllable summarization model  3 . Following Fan et al., we distribute the dataset into a set of disjoint buckets that correspond to summaries of different lengths. Each bucket contains roughly equal number of documents. The distribution is shown in Figure 3.
All competing methods have three flavors: free, truncated and exact. In the free version(Free), given the desired length N , each method generates summaries naturally until an EOS is generated. In the truncated version(Trunc), each method artificially inserts an EOS if EOS has not been generated in the first N tokens. In the exact version(Exact), each method generates N non-EOS tokens by assigning a score of -∞ to the EOS and inserts an EOS after the N -th token. The purpose of Free version is to evaluate the method's ability to generate summaries with desired length; the purpose of the other two versions is to enable fair comparison of the summaries in terms of their content given that the summaries are of equal length.

Experimental Setup
In the following experiments, all the competing models have 8 convolutional layers in both encoder and decoder parts with kernel width as 3.
For each convolutional layer, we set the hidden vector size as 512 and the embedding size as 256.
To alleviate the overfitting problem, we add the dropout (p = 0.2) layer for all convolutional layers and fully connected layers.
To optimize the proposed model, we use Nesterov's accelerated gradient method (Sutskever et al., 2013) with gradient clipping 0.1 (Pascanu et al., 2013), momentum 0.99, and learning rate 0.2. We terminate the training process when the learning rate drops below 10e-5. We set beam size as 5 for the beam search algorithm in the testing step. Next, we introduce the evaluation metrics in the following experiments: 1. ROUGE scores (F1 score) of the produced summaries, including ROUGE-1(R-1), ROUGE-2(R-2) and ROUGE-L(R-L) (Lin, 2004). ROUGE-2 is the most popular metric for summarization.
2. Variance(Var) of the summary lengths against the desired length len: where n is the number of pairs in the dataset, and l i is the length of the generated summary i. We introduce the variance to evaluate the ability of exact control of the output length.
3. Similarity(Sim) between generated summaries and their corresponding reference summaries: where n is the number of pairs. y i is the vector representation of the reference summary i and y i is vector of the corresponding generated summary i. Both y i and y i are the sum of GloVe 4 word vectors of the words in these summaries.
We introduce the similarity metric here to complement the ROUGE scores because Yao et al. (2017a) showed that the standard ROUGE scores cannot capture semantic similarity beyond ngrams. Given the same source document, abstractive summarization may create summaries that don't share many words but mean the same. To show the effectiveness of this Sim metric, we design a dataset from the summarization tasks of TAC 2010∼2011 5 . The TAC dataset consists of 90 topics in total, each with 2 subset. Each subset has 4 reference summaries by different humans. We assume reference summaries about the same topic to be semantically similar to each other, while summaries across topics are unrelated. Thus we created 2,160 pairs of similar summaries as postive data and 2,160 pairs of unrelated summaries as negative data. We then compute the Pearson correlation between the ROUGE score and the ground truth as well as between Sim and the ground truth and show the results in Table 2. Sim metric certainly resembles semantic similarity better than ROUGE by this experiment.
In this paper, we don't use manual evaluation as the major metric. The reason is that Lin (2004) showed that the manual evaluation is unstable and the inter-human agreement is low due to the variety in abstractive summaries. The ROUGE scores and Similarity scores can respectively measure the syntactic similarity and semantic similarity. They are complementary to each other and give better quantitative assessment of the summarization quality.

Experiment 1: Gold Summary Lengths
In the first experiment, for each test documentsummary pair, we set the desired length as the length of the gold summary and ask the competing methods to generate a summary with the desired length. As shown in Table 3, the proposed model (LC) outperforms the other models on all of the evaluation metrics. The ROUGE score shows the accuracy of these models. Lower variance reflects better length control of the model. Higher similarity reflects better quality of generated summaries from the semantic point of view. The LC model achieves the highest ROUGE and similarity scores as well as the lowest variance in both Free and Exact version, which shows the effectiveness of LC for generating high quality summaries under length constraint. In the Trunc version, the LC model outperforms the other comparable models on all evaluation metrics except for the ROUGE score. Note that, the ROUGE scores of LC model are very stable, indicating its effective length control. As for the other two models, they have better ROUGE score on Trunc version. However, as the example shown in Table  4 6 , higher ROUGE scores do not necessarily mean 6 The entities in different color indicate two important roles in the text. The words in bold type mean correct content. high quality abstractive summaries.
The ROUGE score consists of Recall(R), Precision(P) and F1-measure(F). The summary tends to achieve a better ROUGE score when the length of generated summary is slightly shorter than the desired length. In Table 4(b), the CNN model has the same R score as LC model and a higher P score than LC model because of its slightly shorter length. We can see that the CNN model achieve a higher F score even its generated summary is not good. Moreover, for the basic model, the generated summary always repeats the sentences when the length of generated summary is longer than the desired length. In Table 4(c), the P score of its Trunc version would be improved by a large margin. Thus, the ROUGE score for the Trunc version biases toward the models with weak length control. The generated summaries of the LC model in Table 4(d), which capture the semantic of the reference summary and satisfy the constraint length very well, are better than the other two models even with a slightly lower ROUGE score. The topic of this example is that Louis Jordan, who is the son of Frank Jordan, got lost during sailing and was finally rescued from his boat. Our model generates the summary with correct information, but other two models get the Louis Jordan and Frank Jordan mixed up. This is correctly measured by the similarity scores.

Experiment 2: Arbitrary Lengths
In the second experiment, we ask the methods to generate summaries with arbitrary lengths. We report the results of all three methods with five arbitrary lengths: 10, 30, 50, 70 and 90. We show the performance of each model with different length constraints in Table 5, Table 6, Figure 4 and Figure  5. The basic CNN model has the same ROUGE scores in the Free version since it cannot control the length of generated summaries on its own. For , the desired length is mapped to the model's predefined fixed length range(s) that contains the desired length before it produces its summaries. For example, the desired length 10 is mapped to the first bucket (0, 33]. To demonstrate the effectiveness of LC model and further illustrate the results, we show an example of generated summaries by LC(Free) model with different lengths. As shown in Table 7, when the desired length (e.g., 10) is very different from the length of the reference summary, the ROUGE (a) Source document and reference summary (36 tokens) Source document the last time frank jordan spoke with his son, louis jordan was fishing on a sailboat a few miles off the south carolina coast. the next time ... more than two months had passed and the younger jordan was on a contrainer ship 200 miles from north carolina, just rescued from his disabled boat . "i thought i lost you,"the relieved father said. louis jordan, 37, took his sailboat out in late january and hadn't been heard from in 66 days ... the younger jordan said he took his sailboat out to the gulf stream to find some better fishing ... the boat capsized two more times before he was rescued, according to jordan. Reference summary louis jordan says his sailboat capsized three times . he survived by collecting rainwater and eating raw fish . frank jordan told cnn his son is n't an experienced sailor but has a strong will .
(b) Free summary(29 tokens), Trunc summary (29 tokens louis jordan was on a sailboat a few miles off the south carolina coast . he had n't been heard from in 66 days when he was rescued . he was rescued from his boat .
6.06 6.06 6.06 0.0 0.9293 Trunc louis jordan was on a sailboat a few miles off the south carolina coast . he had n't been heard from in 66 days when he was rescued . he was rescued from his boat .
6.06 6.06 6.06 0.0 0.9293 Exact louis jordan was on a sailboat a few miles off the south carolina coast . he had n't been heard from in 66 days when he was rescued . he was rescued from his boat .
6.06 6.06 6.06 -0.9293 score may not be good even though the generated summary matches the reference quite well semantically. The generated summaries from LC model are natural and complete. The summaries with short desired length on Trunc and Exact version would be more vulnerable to the incomplete problem. We randomly sample 100 summaries generated by each model under Trunc and Exact with desired length of 10 and 30, and manually inspect their readibility. This is a simplified human-evaluation of summarization, which just determines whether the sentences in summaries under length control are complete or not. If complete, the score is 1; if not, it is 0. It is easier to accomplish and more reliable than other sophisticated human-evaluation. Table 8 shows that the LC model has a clear advantage over the other two models in terms of summary fluency.
In this experiment, the desired length is fixed for all the documents which is independent from the corresponding lengths of reference summaries such that the generated summaries may include more versatile words and phrases different from the reference summaries. Thus, the similar-

50
" i thought i lost you , " jordan says . the younger jordan was on a sailboat a few miles off the south carolina coast . " i thought i lost you , " jordan tells his son . jordan says he was grateful to the people .
ity score is more reasonable for evaluation than ROUGE score. As shown in Table 6 and Figure  4, the LC model achieves the highest similarity score except for the length of 10 and 30 in the Free version. The reason is that there is only 5% of testing data with the length of reference summary shorter than 30. Due to the effective length control of LC model, the lengths of generated summaries from LC model are usually much shorter than those from the other models and the length of corresponding reference summaries when we set the desired length as 10 or 30. This leads to a relative lower similarity score shown in Figure  4(b) and Figure 4(c). As shown in Figure 5, the LC model achieves the lowest variance. In Figure  5(a), as the length of most summaries is around 50 and the number of summaries with a length of 10 or 90 is small, the CNN model and Fan model has lowest variance at 50 and highest variance at 90. In Figure 5(

Significance Test on Similarity Result
We use significance test to prove that similarity metric is reliable even though the numerical difference of similarity scores in experiment is little. Because the similarity scores of generated summaries do not follow normal distribution, we take Kruskal-Wallis test (Loukina et al., 2014;Albert, 2017) as our significance test to measure that the difference of similarity results of three methods is significant or not. As shown in Table 9, all pvalues are less than 0.05. The smaller p-value, the higher significant. Thus, the difference of the similarity results is significant.

Related Work
In this section, we discuss some previous work on length control in abstractive summarization and explain why we choose CNN as our basic summarization model.

Length Control for Abstractive Summarization
When summarizing a document, it is desirable to be able to control the length of summary so as to cater to different users and scenarios. Most abstractive summarization systems are based on encoder-decoder models and generate summaries whose length depends on the training summaries. Due to the variability of the sequence generation models, such as the different structures and functions, it is hard to design a length constraint method on all summarization models.
Previous methods control summary length by generating EOS token at a particular time.  used an ad-hoc method, in which the system is inhibited from generating the EOS tag by assigning a score of -∞ to the tag and generats a fixed number of words. Kikuchi et al. (2016) proposed two different methods for RNN seq2seq model which can control the summary length by taking length embedding as an additional input for the LSTM and adding desired length into initial memory cell for the LSTM. In this model, they use the Gigawords as dataset and focus on the abstractive summarization in sentence level which generates one sentence as the summary. For CNN seq2seq model,  put some spe-cial markers into the vocabulary which denote different length ranges. It prepends the input of the summarizer with the marker during training and testing. These special markers are predefined and fixed. In this paper, we aim at generating complete summaries with arbitrary desired length naturally for CNN seq2seq model. We use multi-layers CNN seq2seq model on both encoder and decoder. We set the length constraint at the first layer of decoder to implement the length control of the summarization. Compared with other methods, our approach can effectively control the length of generated summary in a natural manner. Meanwhile, it can generate summaries with length approximate to the desired length without semantic losing in less time.

Conclusion
We presented a simple approach to modify existing CNN seq2seq model with a summary length input and were able to train a model that produces summaries of desired length that are fluent and coherent. This is a better solution than the current practice of summary truncation. Compared with the existing summarization methods, we show that our model has the ability to control the output length on its own using its internal state without losing semantic information or sacrificing the ROUGE score.