Global Encoding for Abstractive Summarization

In neural abstractive summarization, the conventional sequence-to-sequence (seq2seq) model often suffers from repetition and semantic irrelevance. To tackle the problem, we propose a global encoding framework, which controls the information flow from the encoder to the decoder based on the global information of the source context. It consists of a convolutional gated unit to perform global encoding to improve the representations of the source-side information. Evaluations on the LCSTS and the English Gigaword both demonstrate that our model outperforms the baseline models, and the analysis shows that our model is capable of generating summary of higher quality and reducing repetition.


Introduction
Abstractive summarization can be regarded as a sequence mapping task that the source text should be mapped to the target summary.Therefore, sequence-to-sequence learning can be applied to neural abstractive summarization (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Cho et al., 2014), whose model consists of an encoder and a decoder.Attention mechanism has been broadly used in seq2seq models where the decoder extracts information from the encoder based on the attention scores on the source-side information (Bahdanau et al., 2014;Luong et al., 2015).Many attention-based seq2seq models have been proposed for abstractive summarization (Rush et al., 2015;Chopra et al., 2016;Nallapati et al., 2016), which outperformed the conventional statistical methods.
Text: the mainstream fatah movement on monday officially chose mahmoud abbas, chairman of the palestine liberation organization (plo), as its candidate to run for the presidential election due on jan.#, ####, the official wafa news agency reported.seq2seq: fatah officially officially elects abbas as candidate for candidate .Gold: fatah officially elects abbas as candidate for presidential election Table 1: An example of the summary of the conventional attention-based seq2seq model on the Gigaword dataset.The text highlighted indicates repetition, "#" refers to masked number.
However, recent studies show that there are salient problems in the attention mechanism.Zhou et al. (2017) pointed out that there is no obvious alignment relationship between the source text and the target summary, and the encoder outputs contain noise for the attention.For example, in the summary generated by the seq2seq in Table 1, "officially" is followed by the same word, as the attention mechanism still attends to the word with high attention score.Attention-based seq2seq model for abstractive summarization can suffer from repetition and semantic irrelevance, causing grammatical errors and insufficient reflection of the main idea of the source text.
To tackle this problem, we propose a model of global encoding for abstractive summarization.We set a convolutional gated unit to perform global encoding on the source context.The gate based on convolutional neural network (CNN) filters each encoder output based on the global context due to the parameter sharing, so that the representations at each time step are refined with consideration of the global context.We conduct experiments on LCSTS and Gigaword, two benchmark datasets for sentence summarization, which shows that our model outperforms the state-of-theart methods with ROUGE-2 F1 score 26.8 and 17.8 respectively.Moreover, the analysis shows Figure 1: Structure of our proposed Convolutional Gated Unit.We implement 1-dimensional convolution with a structure similar to the Inception (Szegedy et al., 2015) over the outputs of the RNN encoder, where k refers to the kernel size.that our model is capable of reducing repetition compared with the seq2seq model.

Global Encoding
Our model is based on the seq2seq model with attention.For the encoder, we set a convolutional gated unit for global encoding.Based on the outputs from the RNN encoder, the global encoding refines the representation of the source context with a CNN to improve the connection of the word representation with the global context.In the following, the techniques are introduced in detail .

Attention-based seq2seq
The RNN encoder receives the word embedding of each word from the source text sequentially.The final hidden state with the information of the whole source text becomes the initial hidden state of the decoder.Here our encoder is a bidirectional LSTM encoder, where the encoder outputs from both directions at each time step are concatenated ).We implement a unidirectional LSTM decoder to read the input words and generate summary word by word, with a fixed target vocabulary embedded in a high-dimensional space Y ∈ R |Y |×dim .At each time step, the decoder generates a summary word y t by sampling from a distribution of the target vocabulary P vocab until sampling the token representing the end of sentence.The hidden state of the decoder s t and the en-coders output h i at each time step i of the encoding process are computed with a weight matrix W a to obtain the global attention α t,i and the context vector c t .It is described below: (2) where C refers to the cell state in the LSTM, and g(•) refers to a non-linear function.

Convolutional Gated Unit
Abstractive summarization requires the core information at each encoding time step.To reach this goal, we implement a gated unit on top of the encoder outputs at each time step, which is a CNN that convolves all the encoder outputs.The parameter sharing of the convolutional kernels enables the model to extract certain types of features, specifically n-gram features.Similar to image, language also contains local correlation, such as the internal correlation of phrase structure.The convolutional units can extract these common features in the sentence and indicate the correlation among the source annotations.Moreover, to further strengthen the global information, we implement self-attention (Vaswani et al., 2017) to mine the relationship of the annotation at a certain time step with other annotations.Therefore, the gated unit is able to find out both common n-gram features and global correlation.Based on the convolution and self-attention, the gated unit sets a gate to filter the source annotations from the RNN encoder, in order to select information relevant to the global semantic meaning.The global encoding allows the encoder output at each time step to become new representation vector with further connection to the global source side information.
For convolution, we implement a structure similar to inception (Szegedy et al., 2015).We use 1dimension convolution to extract n-gram features.
Following the design principle of inception, we did not use kernel where k = 5 but instead used two kernels where k = 3 to avoid large kernel size.The details of convolution block is described be-low: where ReLU refers to the non-linear activation function Rectified Linear Unit (Nair and Hinton, 2010).Based on the convolution block, we implement a structure similar to inception, as shown in Figure 1.On top of the new representations generated by the CNN module, we further implement selfattention upon these representations so as to dig out the global correlations.Vaswani et al. (2017) pointed out that self-attention encourages the model to learn long-term dependencies and does not create much computational complexity, so we implement its scaled dot-product attention for the connection between the annotation at each time step and the global information: where the representations, are computed through the attention mechanism with itself and packed into a matrix.To be specific, we refer Q and V to the representation matrix generated by the CNN module, while K = W att V where W att is a learnable matrix.A further step is to set a gate based on the generation from the CNN and self-attention module g for the source representations h from the RNN encoder, where: Since the CNN module can extract n-gram features of the whole source text and self-attention learns the long-term dependencies among the components of the input source text, the gate can perform global encoding on the encoder outputs.Based on the output of the CNN and self-attention, the logistic sigmoid function outputs a vector of value between 0 and 1 at each dimension.If the value is close to 0, the gate removes most of the information at the corresponding dimension of the source representation, and if it is close to 1, it reserves most of the information.

Training
In the following, we introduce the datasets that we conduct experiments on as well as our experimental settings.
Given the parameters θ and source text x, the models generates a summary ỹ.The learning process is to minimize the negative log-likelihood between the generated summary ỹ and reference y: where the loss function is equivalent to maximizing the conditional probability of summary y given parameters θ and source sequence x.

Experiment Setup
In the following, we introduce the datasets that we conduct experiments on and our experiment settings as well as the baseline models that we compare with.

Datasets
LCSTS is a large-scale Chinese short text summarization dataset collected from Sina Weibo, a famous Chinese social media website (Hu et al., 2015), consisting of more than 2.4 million textsummary pairs.The original texts are shorter than 140 Chinese characters, and the summaries are created manually.We follow the previous research (Hu et al., 2015) to split the dataset for training, validation and testing, with 2.4M sentence pairs for training, 8K for validation and 0.7K for testing.
The English Gigaword is a sentence summarization dataset based on Annotated Gigaword (Napoles et al., 2012), a dataset consisting of sentence pairs, which are the first sentence of the collected news articles and the corresponding headlines.We use the data preprocessed by Rush et al. (2015) with 3.8M sentence pairs for training, 8K for validation and 2K for testing.

Experiment Settings
We implement our experiments in PyTorch on an NVIDIA 1080Ti GPU.The word embedding dimension and the number of hidden units are both 512.In both experiments, the batch size is set to 64.We use Adam optimizer (Kingma and Ba, 2014) with the default setting α = 0.001, β 1 = 0.9, β 2 = 0.999 and = 1 × 10 −8 .The learning rate is halved every epoch.Gradient clipping is applied with range [-10, 10].
Following the previous studies, we choose ROUGE score to evaluate the performance of our model (Lin and Hovy, 2003) calculate the degree of overlapping between generated summary and reference, including the number of n-grams.F1 scores of ROUGE-1, ROUGE-2 and ROUGE-L are used as the evaluation metrics.

Baseline Models
As we compare our results with the results of the baseline models reported in their original papers, the evaluation on the two datasets has different baselines.In the following, we introduce the baselines for LCSTS and Gigaword respectively.Baselines for LCSTS are introduced in the following.RNN and RNN-context are the RNNbased seq2seq models (Hu et al., 2015), without and with attention mechanism respectively.Copy-Net is the attention-based seq2seq model with the copy mechanism (Gu et al., 2016).SRB is a model that improves semantic relevance between source text and summary (Ma et al., 2017).DRGD is the conventional seq2seq with a deep recurrent generative decoder (Li et al., 2017).
As to the baselines for Gigaword, ABS and ABS+ are the models with local attention and handcrafted features (Rush et al., 2015).Feats is a fully RNN seq2seq model with some specific methods to control the vocabulary size.RAS-LSTM and RAS-Elman are seq2seq models with a convolutional encoder and an LSTM decoder and an Elman RNN decoder respectively.SEASS is a seq2seq model with a selective gate mechanism.DRGD is also a baseline for Gigaword.
Results of our implementation of the conventional seq2seq model on both datasets are also used for the evaluation of the improvement of our proposed convolutional gated unit (CGU).

Analysis
In the following sections, we report the results of our experiments and analyze the performance of our model on the evaluation of repetition.Also, we provide an example to demonstrate that our model can generate summary that is more semantically consistent with the source text.

Results
In the experiments on the two datasets, our model achieves advantages of ROUGE score over the baselines, and the advantages of ROUGE score on the LCSTS are significant.Table 2 presents the results of our model and the baselines on the LC-STS, and Table 2 shows the results of models on the Gigaword.We compare the F1 scores of our model with those of the baseline models (reported in their original articles) and our own implementation of the attention-based seq2seq.Compared with the conventional seq2seq model, our model owns an advantage of ROUGE-2 score 3.7 and 1.5 on the LCSTS and Gigaword respectively.

Discussion
We show a summary generated by our model, compared with that of the baseline seq2seq model and the reference.The source text introduces a phenomenon that Starbucks, an ordinary coffee brand in the United States, becomes a brand of high class and sells coffee in a much higher price.It is apparent that the main idea of the text is about the high price of Starbucks coffee in China.However, the seq2seq model generates a summary which only contains the information of the brand and the country.In addition, it has committed a mistake of redundant repetition of the word "China".It is not semantically relevant to the source text and it is not coherent and adequate.Compared with it, the summary of our model is more coherent and more semantically relevant to the source text.Our model focuses on the information about price instead of country, and points  out the price gap in its generated summary.As "China" appears twice in the source text and it is hard for the baseline model to put it in a less significant place, but for our model with CGU, it is able to filter the trivial details that are irrelevant to the core meaning of the source text and just focuses on the information that contributes most to the main idea.As our CGU is responsible for selecting important information of the outputs from the RNN encoder to improve the quality of the attention score, it should be able to reduce repetition in the generated summary.We evaluate the degree of repetition by calculating the percentage of the duplicates at the sentence level.The evaluations on the Gigaword for duplicates of 1-gram to 4 gram prove that our model significantly reduces repetition compared to the conventional seq2seq and its repetition rate is similar to the reference's.This also shows that our model is able to generate summaries of higher diversity with less repetition.

Related Work
Researchers developed many statistical methods and linguistic-rule-based methods to study automatic summarization (Banko et al., 2000;Dorr et al., 2003;Zajic et al., 2004;Cohn and Lapata, 2008).With the development of Neural Network in NLP, more and more researches have appeared in abstractive summarization since it seems possible that Neural Network can help achieve the two goals.Rush et al. (2015) first applied sequence- to-sequence model with attention mechanism to abstractive summarization and realized significant achievements.Chopra et al. (2016) changed the ABS model with an RNN decoder and Nallapati et al. (2016) changed the system to a fully-RNN sequence-to-sequence model and achieved outstanding performance.Zhou et al. (2017) proposed a selective gate mechanism to filter secondary information.Li et al. (2017) proposed a deep recurrent generative decoder to learn latent structure information.Ma et al. (2018) proposed a model that generates words by querying word embeddings.

Conclusion
In this paper, we propose a new model for abstractive summarization.The convolutional gated unit performs global encoding on the source side information so that the core information can be reserved and the secondary information can be filtered.Experiments on the LCSTS and Gigaword show that our model outperforms the baselines, and the analysis shows that it is able to reduce repetition in the generated summaries, and it is more robust to inputs of different lengths, compared with the conventional seq2seq model.

Figure 2 :
Figure 2: Percentage of the duplicates at sentence level.Evaluated on the Gigaword.

Table 2 :
. ROUGE score is to F-Score of ROUGE on LCSTS.

Table 3 :
F-Score of ROUGE on Gigaword.

Table 4 :
An example of our summarization, compared with that of the seq2seq model and the reference.