Cutting-off Redundant Repeating Generations for Neural Abstractive Summarization

This paper tackles the reduction of redundant repeating generation that is often observed in RNN-based encoder-decoder models. Our basic idea is to jointly estimate the upper-bound frequency of each target vocabulary in the encoder and control the output words based on the estimation in the decoder. Our method shows significant improvement over a strong RNN-based encoder-decoder baseline and achieved its best results on an abstractive summarization benchmark.


Introduction
The RNN-based encoder-decoder (EncDec) approach has recently been providing significant progress in various natural language generation (NLG) tasks, i.e., machine translation (MT) (Sutskever et al., 2014; and abstractive summarization (ABS) (Rush et al., 2015). Since a scheme in this approach can be interpreted as a conditional language model, it is suitable for NLG tasks. However, one potential weakness is that it sometimes repeatedly generates the same phrase (or word). This issue has been discussed in the neural MT (NMT) literature as a part of a coverage problem (Tu et al., 2016;Mi et al., 2016). Such repeating generation behavior can become more severe in some NLG tasks than in MT. The very short ABS task in DUC-2003and 2004(Over et al., 2007 is a typical example because it requires the generation of a summary in a pre-defined limited output space, such as ten words or 75 bytes. Thus, the repeated output consumes precious limited output space. Unfortunately, the coverage approach cannot be directly applied to ABS tasks since they require us to optimally find salient ideas from the input in a lossy compression manner, and thus the summary (output) length hardly depends on the input length; an MT task is mainly loss-less generation and nearly one-to-one correspondence between input and output (Nallapati et al., 2016a).
From this background, this paper tackles this issue and proposes a method to overcome it in ABS tasks. The basic idea of our method is to jointly estimate the upper-bound frequency of each target vocabulary that can occur in a summary during the encoding process and exploit the estimation to control the output words in each decoding step. We refer to our additional component as a wordfrequency estimation (WFE) sub-model. The WFE sub-model explicitly manages how many times each word has been generated so far and might be generated in the future during the decoding process. Thus, we expect to decisively prohibit excessive generation. Finally, we evaluate the effectiveness of our method on well-studied ABS benchmark data provided by Rush et al. (2015), and evaluated in (Chopra et al., 2016;Nallapati et al., 2016b;Kikuchi et al., 2016;Takase et al., 2016;Ayana et al., 2016;Gulcehre et al., 2016).

Baseline RNN-based EncDec Model
The baseline of our proposal is an RNN-based EncDec model with an attention mechanism (Luong et al., 2015). In fact, this model has already been used as a strong baseline for ABS tasks (Chopra et al., 2016;Kikuchi et al., 2016) as well as in the NMT literature. More specifically, as a case study we employ a 2-layer bidirectional LSTM encoder and a 2-layer LSTM decoder with a global attention . We omit a detailed review of the descriptions due to space limitations. The following are the necessary parts for explaining our proposed method.
Let X = (x i ) I i=1 and Y = (y j ) J j=1 be input and output sequences, respectively, where x i and   y j are one-hot vectors, which correspond to the i-th word in the input and the j-th word in the output. Let V t denote the vocabulary (set of words) of output. For simplification, this paper uses the following four notation rules: (1) (x i ) I i=1 is a short notation for representing a list of (column) vectors, i.e., (x 1 , . . . , (2) v(a, D) represents a D-dimensional (column) vector whose elements are all a, i.e., v(1, 3) = (1, 1, 1) . Encoder: Let Ω s (·) denote the overall process of our 2-layer bidirectional LSTM encoder. The encoder receives input X and returns a list of final hidden states H s = (h s i ) I i=1 : Decoder: We employ a K-best beam-search decoder to find the (approximated) best outputŶ given input X. Figure 1 shows a typical Kbest beam search algorithm used in the decoder of EncDec approach. We define the (minimal) required information h shown in Figure 1 for the jth decoding process is the following triplet, h = (s j−1 ,Ŷ j−1 , H t j−1 ), where s j−1 is the cumulative log-likelihood from step 0 to j − 1,Ŷ j−1 is a (candidate of) output word sequence generated so far from step 0 to j − 1, that is,Ŷ j−1 = (y 0 , . . . , y j−1 ) and H t j−1 is the all the hidden states for calculating the j-th decoding process. Then, the function calcLL in Line 8 can be written as follows: where Softmax(·) is the softmax function for a given vector and Ω t (·) represents the overall process of a single decoding step. Moreover,Õ in Line 11 is a (M × (K − C))matrix, where C is the number of complete sentences in Q c . The (m, k)-element ofÕ represents a likelihood of the m-th word, namelyõ j [m], that is calculated using the k-th candidate in Q w at the (j − 1)-th step. In Line 12, the function makeTriplet constructs a set of triplets based on the information of index (m,k). Then, in Line 13, the function selectTopK selects the top-K candidates from union of a set of generated triplets at current step {h z } K−C z=1 and a set of triplets of complete sentences in Q c . Finally, the function sepComp in Line 13 divides a set of triplets Q in two distinct sets whether they are complete sentences, Q c , or not, Q w . If the elements in Q are all complete sentences, namely, Q c = Q and Q w = ∅, then the algorithm stops according to the evaluation of Line 15.

Word Frequency Estimation
This section describes our proposed method, which roughly consists of two parts: (1) a submodel that estimates the upper-bound frequencies of the target vocabulary words in the output, and (2) architecture for controlling the output words in the decoder using estimations.

Definition
Letâ denote a vector representation of the frequency estimation. denotes element-wise product.â is calculated by: where Sigmoid(·) and ReLu(·) represent the element-wise sigmoid and ReLU (Glorot et al., 2011), respectively. Thus,r We incorporate two separated components,r andĝ, to improve the frequency fitting. The purpose ofĝ is to distinguish whether the target words occur or not, regardless of their frequency. Thus, g can be interpreted as a gate function that resembles estimating the fertility in the coverage (Tu et al., 2016) and a switch probability in the copy mechanism (Gulcehre et al., 2016). These ideas originated from such gated recurrent networks as LSTM (Hochreiter and Schmidhuber, 1997) and GRU (Chung et al., 2014). Then,r can much focus on to model frequency equal to or larger than 1. This separation can be expected sincer [m] has no influence ifĝ[m] = 0.

Effective usage
The technical challenge of our method is effectively leveraging WFEâ. Among several possible choices, we selected to integrate it as prior knowledge in the decoder. To do so, we re-defineõ j in Eq. 2 as: The difference is the additional term ofã j , which is an adjusted likelihood for the j-th step originally calculated fromâ. We defineã j as: ClipReLU 1 (·) is a function that receives a vector and performs an element-wise calculation: x [m] = max (0, min(1, x[m])) for all m if it receives x. We define the relation betweenr j in Eq. 4 andr in Eq. 3 as follows: Eq. 5 is updated fromr j−1 tor j with the estimated output of previous stepŷ j−1 . Sinceŷ j ∈ {0, 1} M for all j, all of the elements inr j are monotonically non-increasing.
. This means that the m-th word will never be selected any more at step j ≤ j for all j. Thus, the interpretation of r j is that it directly manages the upper-bound frequency of each target word that can occur in the current and future decoding time steps. As a result, decoding with our method never generates words that exceed the estimationr, and thus we expect to reduce the redundant repeating generation. Note here that our method never requires r j [m] ≤ 0 (orr j [m] = 0) for all m at the last decoding time step j, as is generally required in the list of hidden states generated by encoder occurrence estimation Output: (g, r) Figure 2: Procedure for calculating the components of our WFE sub-model.
coverage (Tu et al., 2016;Mi et al., 2016;. This is why we say upper-bound frequency estimation, not just (exact) frequency. Figure 2 shows the detailed procedure for calculating g and r in Eq. 3. For r, we sum up all of the features of the input given by the encoder (Line 2) and estimate the frequency. In contrast, for g, we expect Lines 5 and 6 to work as a kind of voting for both positive and negative directions since g needs just occurrence information, not frequency. For example, g may take large positive or negative values if a certain input word (feature) has a strong influence for occurring or not occurring specific target word(s) in the output. This idea is borrowed from the Max-pooling layer (Goodfellow et al., 2013).

Parameter estimation (Training)
Given the training data, let a * ∈ P M be a vector representation of the true frequency of the target words given the input, where P = {0, 1, . . . , +∞}. Clearly a * can be obtained by counting the words in the corresponding output. We define loss function Ψ wfe for estimating our WFE sub-model as follows: where W represents the overall parameters. The form of Ψ wfe (·) is closely related to that used in support vector regression (SVR) (Smola and Schölkopf, 2004   of a * are an integer. The remaining 0.25 for both the positive and negative sides denotes the margin between every integer. We select b = 2 to penalize larger for more distant error, and c 1 < c 2 , i.e., c 1 = 0.2, c 2 = 1, since we aim to obtain upper-bound estimation and to penalize the underestimation below the true frequency a * . Finally, we minimize Eq. 6 with a standard negative log-likelihood objective function to estimate the baseline EncDec model.

Experiments
We investigated the effectiveness of our method on ABS experiments, which were first performed by Rush et al., (2015). The data consist of approximately 3.8 million training, 400,000 validation and 400,000 test data, respectively 2 . Generally, 1951 test data, randomly extracted from the test data section, are used for evaluation 3 . Additionally, DUC-2004 evaluation data (Over et al., 2007) 4 were also evaluated by the identical models trained on the above Gigaword data. We strictly followed the instructions of the evaluation setting used in previous studies for a fair comparison. Table 1 summarizes the model configuration and the parameter estimation setting in our experiments. Table 2 shows the results of the baseline EncDec and our proposed EncDec+WFE. Note that the 2 The data can be created by the data construction scripts in the author's code: https://github.com/facebook/NAMAS. 3 As previously described (Chopra et al., 2016) we removed the ill-formed (empty) data for Gigaword. 4 http://duc.nist.gov/duc2004/tasks.html G: china success at youth world championship shows preparation for #### olympics A: china germany germany germany germany and germany at world youth championship B: china faces germany at world youth championship G: British and Spanish governments leave extradition of Pinochet to courts A: spain britain seek shelter from pinochet 's pinochet case over pinochet 's B: spain britain seek shelter over pinochet 's possible extradition from spain G: torn UNK : plum island juniper duo now just a lone tree A: black women black women black in black code B: in plum island of the ancient DUC-2004 data was evaluated by recall-based ROUGE scores, while the Gigaword data was evaluated by F-score-based ROUGE, respectively. For a validity confirmation of our EncDec baseline, we also performed OpenNMT tool 5 . The results on Gigaword data with B = 5 were, 33.65, 16.12, and 31.37 for ROUGE-1(F), ROUGE-2(F) and ROUGE-L(F), respectively, which were almost similar results (but slightly lower) with our implementation. This supports that our baseline worked well as a strong baseline. Clearly, EncDec+WFE significantly outperformed the strong EncDec baseline by a wide margin on the ROUGE scores. Thus, we conclude that the WFE sub-model has a positive impact to gain the ABS performance since performance gains were derived only by the effect of incorporating our WFE sub-model. Table 3 lists the current top system results. Our method EncDec+WFE successfully achieved the current best scores on most evaluations. This result also supports the effectiveness of incorporating our WFE sub-model.

Comparison to current top systems
MRT (Ayana et al., 2016) previously provided the best results. Note that its model structure is nearly identical to our baseline. On the contrary, MRT trained a model with a sequence-wise min-5 http://opennmt.net DUC-2004 (w/ 75-byte limit) Gigaword (w/o length limit) Method Beam ROUGE-1(R) ROUGE-2(R) ROUGE-L(R) ROUGE-    imum risk estimation, while we trained all the models in our experiments with standard (pointwise) log-likelihood maximization. MRT essentially complements our method. We expect to further improve its performance by applying MRT for its training since recent progress of NMT has suggested leveraging a sequence-wise optimization technique for improving performance (Wiseman and . We leave this as our future work. Figure 3 shows actual generation examples. Based on our motivation, we specifically selected the redundant repeating output that occurred in the baseline EncDec. It is clear that EncDec+WFE successfully reduced them. This observation offers further evidence of the effectiveness of our method in quality.

Performance of the WFE sub-model
To evaluate the WFE sub-model alone, Table 4 shows the confusion matrix of the frequency esti-mation. We quantizedâ by â[m]+0.5 for all m, where 0.5 was derived from the margin in Ψ wfe . Unfortunately, the result looks not so well. There seems to exist an enough room to improve the estimation. However, we emphasize that it already has an enough power to improve the overall quality as shown in Table 2 and Figure 3. We can expect to further gain the overall performance by improving the performance of the WFE sub-model.

Conclusion
This paper discussed the behavior of redundant repeating generation often observed in neural EncDec approaches. We proposed a method for reducing such redundancy by incorporating a submodel that directly estimates and manages the frequency of each target vocabulary in the output. Experiments on ABS benchmark data showed the effectiveness of our method, EncDec+WFE, for both improving automatic evaluation performance and reducing the actual redundancy. Our method is suitable for lossy compression tasks such as image caption generation tasks.