Controlling Output Length in Neural Encoder-Decoders

Neural encoder-decoder models have shown great success in many sequence generation tasks. However, previous work has not investigated situations in which we would like to control the length of encoder-decoder outputs. This capability is crucial for applications such as text summarization, in which we have to generate concise summaries with a desired length. In this paper, we propose methods for controlling the output sequence length for neural encoder-decoder models: two decoding-based methods and two learning-based methods. Results show that our learning-based methods have the capability to control length without degrading summary quality in a summarization task.

One of the essential properties that text summarization systems should have is the ability to generate a summary with the desired length.Desired lengths of summaries strongly depends on the scene of use, such as the granularity of information the user wants to understand, or the monitor size of the device the user has.The length also depends on the amount of information contained in the given source document.Hence, in the traditional setting of text summarization, both the source document and the desired length of the summary will be given as input to a summarization system.However, methods for controlling the output sequence length of encoderdecoder models have not been investigated yet, despite their importance in these settings.
In this paper, we propose and investigate four methods for controlling the output sequence length for neural encoder-decoder models.The former two methods are decoding-based; they receive the desired length during the decoding process, and the training process is the same as standard encoderdecoder models.
The latter two methods are learning-based; we modify the network architecture to receive the desired length as input.
In experiments, we show that the learning-based methods outperform the decoding-based methods for long (such as 50 or 75 byte) summaries.We also find that despite this additional length-control capability, the proposed methods remain competitive to existing methods on standard settings of the DUC2004 shared task-1.

Related Work
Text summarization is one of the oldest fields of study in natural language processing, and many summarization methods have focused specifically on sentence compression or headline generation.Traditional approaches to this task focus on word deletion using rule-based (Dorr et al., 2003;Zajic et al., 2004) or statistical (Woodsend et al., 2010;Galanis and Androutsopoulos, 2010;Filippova and Strube, 2008;Filippova and Altun, 2013;Filippova et al., 2015) methods.There are also several studies of abstractive sentence summarization using syntactic transduction (Cohn and Lapata, 2008;Napoles et al., 2011) or taking a phrase-based statistical machine translation approach (Banko et al., 2000;Wubben et al., 2012;Cohn and Lapata, 2013).
Recent work has adopted techniques such as encoder-decoder (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Cho et al., 2014) andattentional (Bahdanau et al., 2015;Luong et al., 2015) neural network models from the field of machine translation, and tailored them to the sentence summarization task.Rush et al. (2015) were the first to pose sentence summarization as a new target task for neural sequence-to-sequence learning.Several studies have used this task as one of the benchmarks of their neural sequence transduction methods (Ranzato et al., 2015;Lopyrev, 2015;Ayana et al., 2016).Some studies address the other important phenomena frequently occurred in humanwritten summaries, such as copying from the source document (Gu et al., 2016;Gulcehre et al., 2016).Nallapati et al. (2016) investigate a way to solve many important problems capturing keywords, or inputting multiple sentences.
Neural encoder-decoders can also be viewed as statistical language models conditioned on the target sentence context.Rosenfeld et al. (2001) have proposed whole-sentence language models that can consider features such as sentence length.However, as described in the introduction, to our knowledge, explicitly controlling length of output sequences in neural language models or encoder-decoders has not been investigated.Finally, there are some studies to modify the output sequence according some meta information such as the dialogue act (Wen et al., 2015), user personality (Li et al., 2016b), or politeness (Sennrich et al., 2016).However, these studies have not focused on length, the topic of this paper.

Importance of Controlling Output Length
As we already mentioned in Section 1, the most standard setting in text summarization is to input both the source document and the desired length of the summary to a summarization system.Summarization systems thus must be able to generate summaries of various lengths.Obviously, this property is also essential for summarization methods based on neural encoder-decoder models.
Since an encoder-decoder model is a completely data-driven approach, the output sequence length depends on the training data that the model is trained on.For example, we use sentence-summary pairs extracted from the Annotated English Gigaword corpus as training data (Rush et al., 2015), and the average length of human-written summary is 51.38 bytes.Figure 1 shows the statistics of the corpus.When we train a standard encoder-decoder model and perform the standard beam search decoding on the corpus, the average length of its output sequence is 38.02 byte.
However, there are other situations where we want summaries with other lengths.For example, DUC2004 is a shared task where the maximum length of summaries is set to 75 bytes, and summarization systems would benefit from generating sentences up to this length limit.
While recent NSS models themselves cannot control their output length, Rush et al. (2015) and others following use an ad-hoc method, in which the system is inhibited from generating the end-of-sentence (EOS) tag by assigning a score of −∞ to the tag and (c) ratio (0.30) Figure 1: Histograms of first sentence length, headline length, and their ratio in Annotated Gigaword English Gigaword corpus.Bracketed values in each subcaption are averages.generating a fixed number of words 2 , and finally the output summaries are truncated to 75 bytes.Ideally, the models should be able to change the output sequence depending on the given output length, and to output the EOS tag at the appropriate time point in a natural manner.
3 Network Architecture: Encoder-Decoder with Attention In this section, we describe the model architecture used for our experiments: an encoder-decoder consisting of bi-directional RNNs and an attention mechanism.Figure 2 shows the architecture of the model.Suppose that the source sentence is represented as a sequence of words x = (x 1 , x 2 , x 3 , ..., x N ).For 2 According to the published code (https://github.com/facebook/NAMAS), the default number of words is set to 15, which is too long for the DUC2004 setting.The average number of words of human summaries in the evaluation set is 10.43. a given source sentence, the summarizer generates a shortened version of the input (i.e.N > M ), as summary sentence y = (y 1 , y 2 , y 3 , ..., y M ).The model estimates conditional probability p(y|x) using parameters trained on large training data consisting of sentence-summary pairs.Typically, this conditional probability is factorized as the product of conditional probabilities of the next word in the sequence: where y <t = (y 1 , y 2 , y 3 , ..., y t−1 ).In the following, we describe how to compute p(y t |y <t , x).

Encoder
We use the bi-directional RNN (BiRNN) as encoder which has been shown effective in neural machine translation (Bahdanau et al., 2015) and speech recognition (Schuster and Paliwal, 1997;Graves et al., 2013).
A BiRNN processes the source sentence for both forward and backward directions with two separate RNNs.During the encoding process, the BiRNN computes both forward hidden states While g can be any kind of recurrent unit, we use long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) networks that have memory cells for both directions ( − → c t and ← − c t ).
After encoding, we set the initial hidden states s 0 and memory-cell m 0 of the decoder as follows:

Decoder and Attender
Our decoder is based on an RNN with LSTM g: We also use the attention mechanism developed by Luong et al. (2015), which uses s t to compute contextual information d t of time step t.We first summarize the forward and backward encoder states by taking their sum hi = − → h i + ← − h i , and then calculate the context vector d t as the weighted sum of these summarized vectors: where a t is the weight at the t-th step for hi computed by a softmax operation: .
After context vector d t is calculated, the model updates the distribution over the next word as follows: Note that st is also provided as input to the LSTM with y t for the next step, which is called the input feeding architecture (Luong et al., 2015).

Training and Decoding
The training objective of our models is to maximize log likelihood of the sentence-summary pairs in a given training set D: Once models are trained, we use beam search to find the output that maximizes the conditional probability.

Controlling Length in Encoder-decoders
In this section, we propose our four methods that can control the length of the output in the encoderdecoder framework.In the first two methods, the decoding process is used to control the output length without changing the model itself.In the other two methods, the model itself has been changed and is trained to obtain the capability of controlling the length.Following the evaluation dataset used in our experiments, we use bytes as the unit of length, although our models can use either words or bytes as necessary.

f ixLen: Beam Search without EOS Tags
The first method we examine is a decoding approach similar to the one taken in many recent NSS methods that is slightly less ad-hoc.In this method, we inhibit the decoder from generating the EOS tag by assigning it a score of −∞.Since the model cannot stop the decoding process by itself, we simply stop the decoding process when the length of output sequence reaches the desired length.More specifically, during beam search, when the length of the sequence generated so far exceeds the desired length, the last word is replaced with the EOS tag and also the score of the last word is replaced with the score of the EOS tag (EOS replacement).

f ixRng: Discarding Out-of-range Sequences
Our second decoding method is based on discarding out-of-range sequences, and is not inhibited from generating the EOS tag, allowing it to decide when to stop generation.Instead, we define the legitimate range of the sequence by setting minimum and maximum lengths.Specifically, in addition to the normal beam search procedure, we set two rules: • If the model generates the EOS tag when the output sequence is shorter than the minimum length, we discard the sequence from the beam.
• If the generated sequence exceeds the maximum length, we also discard the sequence from the beam.We then replace its last word with the EOS tag and add this sequence to the beam (EOS replacement in Section 4.1). 3 In other words, we keep only the sequences that contain the EOS tag and are in the defined length range.This method is a compromise that allows the model some flexibility to plan the generated sequences, but only within a certain acceptable length range.
It should be noted that this method needs a larger beam size if the desired length is very different from the average summary length in the training data, as it will need to preserve hypotheses that have the desired length.

LenEmb: Length Embedding as
Additional Input for the LSTM Our third method is a learning-based method specifically trained to control the length of the output sequence.Inspired by previous work that has demonstrated that additional inputs to decoder models can effectively control the characteristics of the output (Wen et al., 2015;Li et al., 2016b), this model provides information about the length in the form of an additional input to the net.Specifically, the model uses an embedding e 2 (l t ) ∈ R D for each potential desired length, which is parameterized by a length embedding matrix W le ∈ R D×L where L is the number of length types.In the decoding process, we input the embedding of the remaining length l t as additional input to the LSTM (Figure 3).l t is initialized after the encoding process and updated during the decoding process as follows: where byte(y t ) is the length of output word y t and length is the desired length.We learn the values of the length embedding matrix W le during training.This method provides additional information about the amount of length remaining in the output sequence, allowing the decoder to "plan" its output based on the remaining number of words it can generate.
3 This is a workaround to prevent the situation in which all sequences are discarded from a beam.

LenInit:
Length-based Memory Cell Initialization While LenEmb inputs the remaining length l t to the decoder at each step of the decoding process, the LenInit method inputs the desired length once at the initial state of the decoder.Figure 4 shows the architecture of LenInit.Specifically, the model uses the memory cell m t to control the output length by initializing the states of decoder (hidden state s 0 and memory cell m 0 ) as follows: where b c ∈ R H is a trainable parameter and length is the desired length.While the model of LenEmb is guided towards the appropriate output length by inputting the remaining length at each step, this LenInit attempts to provide the model with the ability to manage the output length on its own using its inner state.Specifically, the memory cell of LSTM networks is suitable for this endeavour, as it is possible for LSTMs to learn functions that, for example, subtract a fixed amount from a particular memory cell every time they output a word.Although other ways for managing the length are also possible,4 we found this approach to be both simple and effective.

Dataset
We trained our models on a part of the Annotated English Gigaword corpus (Napoles et al., 2012), which Rush et al. (2015) constructed for sentence summarization.We perform preprocessing using the standard script for the dataset5 .The dataset consists of approximately 3.6 million pairs of the first sentence from each source document and its headline.Figure 1 shows the length histograms of the summaries in the training set.The vocabulary size is 116,875 for the source documents and 67,564 for the target summaries including the beginning-ofsentence, end-of-sentence, and unknown word tags.
For LenEmb and LenInit, we input the length of each headline during training.Note that we do not train multiple summarization models for each headline length, but a single model that is capable of controlling the length of its output.
We evaluate the methods on the evaluation set of DUC2004 task-1 (generating very short singledocument summaries).In this task, summarization systems are required to create a very short summary for each given document.Summaries over the length limit (75 bytes) will be truncated and there is no bonus for creating a shorter summary.The evaluation set consists of 500 source documents and 4 human-written (reference) summaries for each source document.Figure 5 shows the length histograms of the summaries in the evaluation set.Note that the human-written summaries are not always as long as 75 bytes.We used three variants of ROUGE (Lin, 2004) as evaluation metrics: ROUGE-1 (unigram), ROUGE-2 (bigram), and ROUGE-L (longest common subsequence).The two-sided permutation test (Chinchor, 1992) was used for statistical significance testing (p ≤ 0.05).

Implementation
We use Adam (Kingma and Ba, 2015) (α=0.001,β 1 =0.9, β 2 =0.999, eps=10 −8 ) to optimize parameters with a mini-batch of size 80.Before every 10,000 updates, we first sampled 800,000 training examples and made groups of 80 examples with the same source sentence length, and shuffled the 10,000 groups.
We set the dimension of word embeddings to 100 and that of the hidden state to 200.For LSTMs, we initialize the bias of the forget gate to 1.0 and use 0.0 for the other gate biases (Józefowicz et al., 2015).We use Chainer (Tokui et al., 2015) to implement our models.For LenEmb, we set L to 300, which is larger than the longest summary lengths in our dataset (see Figure 1-(b) and Figure 5-(b)).
For all methods except f ixRng, we found a beam size of 10 to be sufficient, but for f ixRng we used a beam size of 30 because it more aggressively discards candidate sequences from its beams during decoding.

ROUGE Evaluation
Table 1 shows the ROUGE scores of each method with various length limits (30, 50 and 75 byte).Regardless of the length limit set for the summariza-  tion methods, we use the same reference summaries.Note that, f ixLen and f ixRng generate the summaries with a hard constraint due to their decoding process, which allows them to follow the hard constraint on length.Hence, when we calculate the scores of LenEmb and LenInit, we impose a hard constraint on length to make the comparison fair (i.e.LenEmb (0,L) and LenInit (0,L) in the table ).Specifically, we use the same beam search as that for f ixRng with minimum length of 0.
For the purpose of showing the length control capability of LenEmb and LenInit, we show at the bottom two lines the results of the standard beam search without the hard constraints on the length 6 .We will use the results of LenEmb (0,∞) and LenInit (0,∞) in the discussions in Sections 6.2 and 6.3.
The results show that the learning-based meth-6 f ixRng is equivalence to the standard beam search when we set the range as (0, ∞).
ods (LenEmb and LenInit) tend to outperform decoding-based methods (f ixLen and f ixRng) for the longer summaries of 50 and 75 bytes.However, in the 30-byte setting, there is no significant difference between these two types of methods.We hypothesize that this is because average compression rate in the training data is 30% (Figure 1-(c)) while the 30-byte setting forces the model to generate summaries with 15.38% in average compression rate, and thus the learning-based models did not have enough training data to learn compression at such a steep rate.

Examples of Generated Summaries
Tables 2 and 3 show examples from the validation set of the Annotated Gigaword Corpus.The tables show that all models, including both learningbased methods and decoding-based methods, can often generate well-formed sentences.
We can see various paraphrases of "#### us figure  championships"7 and "withdrew".Some examples are generated as a single noun phrase (LenEmb(30) and LenInit( 30)) which may be suitable for the short length setting.

Length Control Capability of Learning-based Models
Figure 6 shows histograms of output length from the standard encoder-decoder, LenEmb, and LenInit.
While the output lengths from the standard model disperse widely, the lengths from our learning-based models are concentrated to the desired length.These histograms clearly show the length controlling capability of our learning-based models.Table 4-(a) shows the final state of the beam when LenInit generates the sentence with a length of 30 bytes for the example with standard beam search in Table 3.We can see all the sentences in the beam are generated with length close to the desired length.This shows that our method has obtained the ability to control the output length as expected.For comparison, Table 4-(b) shows the final state of the beam if we perform standard beam search in the standard encoder-decoder model (used in f ixLen and f ixRng).Although each sentence is well-formed, the lengths of them are much more varied.

Comparison with Existing Methods
Finally, we compare our methods to existing methods on standard settings of the DUC2004 shared task-1.Although the objective of this paper is not to obtain state-of-the-art scores on this evaluation set, it is of interest whether our length-controllable models are competitive on this task.Table 5 shows that the scores of our methods, which are copied from Table 1, in addition to the scores of some existing methods.ABS (Rush et al., 2015) is the most standard model of neural sentence summarization and is the most similar method to our baseline setting (f ixLen).This table shows that the score of f ixLen is comparable to those of the existing methods.The table also shows the LenEmb and the LenInit have the capability of controlling the length without decreasing the ROUGE score.

Conclusion
In this paper, we presented the first examination of the problem of controlling length in neural encoderdecoder models, from the point of view of summarization.We examined methods for controlling length of output sequences: two decoding-based methods (f ixLen and f ixRng) and two learningbased methods (LenEmb and LenInit).The results showed that learning-based methods generally outperform the decoding-based methods, and the learning-based methods obtained the capability of controlling the output length without losing ROUGE score compared to existing summarization methods.(Rush et al., 2015) 26.55 7.06 22.05 ABS+ (Rush et al., 2015) 28.18 8.49 23.81 RAS-Elman (Chopra et al., 2016) 28.97 8.26 24.06 RAS-LSTM (Chopra et al., 2016) 27.41 7.69 23.06

Figure 5 :
Figure 5: Histograms of first sentence length, summary length, and their ratio in DUC2004.

Table 1 :
14.34 3.10 * 13.23 20.00 * 5.98 18.26 * 25.87 * ROUGE scores with various length limits.The scores with * are significantly worse than the best score in the column (bolded).

Table 2 :
Examples of the output of each method with various specified lengths.

Table 3 :
More examples of the output of each method.

Table 4 :
Final state of the beam when the learning-based model is instructed to output a 30 byte summary for the source document in Table3.LenInit Figure 6: Histograms of output lengths generated by (a) the standard encoder-decoder , (b) LenEmb, and (c)LenInit.For LenEmb and LenInit, the bracketed numbers in each region are the desired lengths we set.

Table 5 :
Comparison with existing studies for DUC2004.Note that top four rows are reproduced from Table 1.opportunity to use the Kurisu server of Dwango Co., Ltd. for our experiments.