Learning to Encode Text as Human-Readable Summaries using Generative Adversarial Networks

Auto-encoders compress input data into a latent-space representation and reconstruct the original data from the representation. This latent representation is not easily interpreted by humans. In this paper, we propose training an auto-encoder that encodes input text into human-readable sentences, and unpaired abstractive summarization is thereby achieved. The auto-encoder is composed of a generator and a reconstructor. The generator encodes the input text into a shorter word sequence, and the reconstructor recovers the generator input from the generator output. To make the generator output human-readable, a discriminator restricts the output of the generator to resemble human-written sentences. By taking the generator output as the summary of the input text, abstractive summarization is achieved without document-summary pairs as training data. Promising results are shown on both English and Chinese corpora.


Introduction
When it comes to learning data representations, a popular approach involves the auto-encoder architecture, which compresses the data into a latent representation without supervision. In this paper we focus on learning text representations. Because text is a sequence of words, to encode a sequence, a sequence-to-sequence (seq2seq) autoencoder (Li et al., 2015;Kiros et al., 2015) is usually used, in which a RNN is used to encode the input sequence into a fixed-length representation, after which another RNN is used to decode the original input sequence given this representation.
Although the latent representation learned by the seq2seq auto-encoder can be used in downstream applications, it is usually not humanreadable. A human-readable representation should comply the rule of human grammar and can be comprehended by human. Therefore, in this work, we use comprehensible natural language as a latent representation of the input source text in an auto-encoder architecture. This human-readable latent representation is shorter than the source text; in order to reconstruct the source text, it must reflect the core idea of the source text. Intuitively, the latent representation can be considered a summary of the text, so unpaired abstractive summarization is thereby achieved.
The idea that using human comprehensible language as a latent representation has been explored on text summarization, but only in a semisupervised scenario. Previous work (Miao and Blunsom, 2016) uses a prior distribution from a pre-trained language model to constrain the generated sequence to natural language. However, to teach the compressor network to generate text summaries, the model is trained using labeled data. In contrast, in this work we need no labeled data to learn the representation.
As shown in Fig. 1, the proposed model is composed of three components: a generator, a discriminator, and a reconstructor. Together, the generator and reconstructor form a text auto-encoder. The generator acts as an encoder in generating the latent representation from the input text. Instead of using a vector as latent representation, however, the generator generates a word sequence much shorter than the input text. From the shorter text, the reconstructor reconstructs the original input of the generator. By minimizing the reconstruction loss, the generator learns to generate short text segments that contain the main information in the original input. We use the seq2seq model in modeling the generator and reconstructor because both have input and output sequences with different lengths. However, it is very possible that the generator's output word sequence can only be processed and recognized by the reconstructor but is not readable by humans. Here, instead of regularizing the generator output with a pre-trained language model (Miao and Blunsom, 2016), we borrow from adversarial auto-encoders (Makhzani et al., 2015) and cycle GAN (Zhu et al., 2017) and introduce a third component -the discriminator -to regularize the generator's output word sequence. The discriminator and the generator form a generative adversarial network (GAN) (Goodfellow et al., 2014). The discriminator discriminates between the generator output and humanwritten sentences, and the generator produces output as similar as possible to human-written sentences to confuse the discriminator. With the GAN framework, the discriminator teaches the generator how to create human-like summary sentences as a latent representation. However, due to the non-differential property of discrete distributions, generating discrete distributions by GAN is challenging. To tackle this problem, in this work, we proposed a new kind of method on language generation by GAN.
By achieving unpaired abstractive text summarization, machine is able to unsupervisedly extract the core idea of the documents. This approach has many potential applications. For example, the output of the generator can be used for the downstream tasks like document classification and sentiment classification. In this study, we evaluate the results on an abstractive text summarization task. The output word sequence of the generator is regarded as the summaries of the input text. The model is learned from a set of documents without summaries. As most documents are not paired with summaries, for example the movie reviews or lecture recordings, this technique makes it possible to learn summarizer to generate summaries for these documents. The results show that the generator generates summaries with reasonable quality on both English and Chinese corpora.

Related Work
Abstractive Text Summarization Recent model architectures for abstractive text summarization basically use the sequence-tosequence (Sutskever et al., 2014) framework in combination with various novel mechanisms. One popular mechanism is attention (Bahdanau et al., 2015), which has been shown helpful for summarization (Nallapati et al., 2016;Rush et al., 2015;. It is also possible to directly optimize evaluation metrics such as ROUGE (Lin, Figure 1: Proposed model. Given long text, the generator produces a shorter text as a summary. The generator is learned by minimizing the reconstruction loss together with the reconstructor and making discriminator regard its output as humanwritten text. 2004) with reinforcement learning (Ranzato et al., 2016;Paulus et al., 2017;Bahdanau et al., 2016). The hybrid pointer-generator network (See et al., 2017) selects words from the original text with a pointer (Vinyals et al., 2015) or from the whole vocabulary with a trained weight. In order to eliminate repetition, a coverage vector (Tu et al., 2016) can be used to keep track of attended words, and coverage loss (See et al., 2017) can be used to encourage model focus on diverse words. While most papers focus on supervised learning with novel mechanisms, in this paper, we explore unsupervised training models.

GAN for Language Generation
In this paper, we borrow the idea of GAN to make the generator output human-readable. The major challenge in applying GAN to sentence generation is the discrete nature of natural language. To generate a word sequence, the generator usually has non-differential parts such as argmax or other sample functions which cause the original GAN to fail.
In (Gulrajani et al., 2017), instead of feeding a discrete word sequence, the authors directly feed the generator output layer to the discriminator. This method works because they use the earth mover's distance on GAN as proposed in , which is able to evaluate the distance between a discrete and a continuous distribution. SeqGAN (Yu et al., 2017) tackles the sequence generation problem with reinforcement learning. Here, we refer to this approach as adversarial REINFORCE. However, the discriminator only measures the quality of whole sequence, and thus the rewards are extremely sparse and the rewards assigned to all the generation steps are all the same. MC search (Yu et al., 2017) is proposed to evaluate the approximate reward at each time step, but this method suffers from high time complexity. Following this idea, (Li et al., 2017) proposes partial evaluation approach to evaluate the expected reward at each time step. In this paper, we propose the self-critical adversarial RE-INFORCE algorithm as another way to evaluate the expected reward at each time step. The performance between original WGAN and proposed adversarial REINFORCE is compared in experiment.

Proposed Method
The overview of the proposed model is shown in Fig. 2. The model is composed of three components: generator G, discriminator D, and reconstructor R. Both G and R are seq2seq hybrid pointer-generator networks (See et al., 2017) which can decide to copy words from encoder input text via pointing or generate from vocabulary. They both take a word sequence as input and output a sequence of word distributions. Discriminator D, on the other hand, takes a sequence as input and outputs a scalar. The model is learned from a set of documents x and human-written sentences y real .
To train the model, a training document resents a word, is fed to G, which outputs a sequence of word distributions G(x) = {y 1 , y 2 , ..., y n , ..., y N }, where y n is a distribution over all words in the lexicon. Then we randomly sample a word y s n from each distribution y n , and a word sequence y s = {y s 1 , y s 2 , ..., y s N } is obtained according to G(x). We feed the sampled word sequence y s to reconstructor R, which outputs another sequence of word distributionsx. The reconstructor R reconstructs the original text x from y s . That is, we seek an output of reconstructor x that is as close to the original text x as possible; hence the loss for training the reconstructor, R loss , is defined as where the reconstruction loss l s (x,x) is the crossentropy loss computed between the reconstructor output sequencex and the source text x, or the negative conditional log-likelihood of source text x given word sequence y s sampled from G(x). The reconstructor output sequencex is teacherforced by source text x. The subscript s in l s (x,x) indicates thatx is reconstructed from y s . K is the number of training documents, and (1) is the summation of the cross-entropy loss over all the training documents x. In the proposed model, the generator G and reconstructor R form an auto-encoder. However, the reconstructor R does not directly take the generator output distribution G(x) as input 1 . Instead, the reconstructor takes a sampled discrete sequence y s as input. Due to the non-differentiable property of discrete sequences, we apply the REINFORCE algorithm, which is described in Section 4.
In addition to reconstruction, we need the discriminator D to discriminate between the real sequence y real and the generated sequence y s to regularize the generated sequence satisfying the summary distribution. D learns to give y real higher scores while giving y s lower scores. The loss for training the discriminator D is denoted as D loss ; this is further described in Section 5.
G learns to minimize the reconstruction loss R loss , while maximizing the loss of the discriminator D by generating a summary sequence y s that cannot be differentiated by D from the real thing. The loss for the generator G loss is where D � loss is highly related to D loss -but not necessary the same 2 -and α is a hyper-parameter. After obtaining the optimal generator by minimizing (2), we use it to generate summaries.
Generator G and discriminator D together form a GAN. We use two different adversarial training methods to train D and G; as shown in Fig. 2, these two methods have their own discriminators 1 and 2. Discriminator 1 takes the generator output layer G(x) as input, whereas discriminator 2 takes the sampled discrete word sequence y s as input. The two methods are described respectively in Sections 5.1 and 5.2.

Minimizing Reconstruction Loss
Because discrete sequences are non-differentiable, we use the REINFORCE algorithm. The generator is seen as an agent whose reward given the source text x is −l s (x,x). Maximizing the reward is equivalent to minimizing the reconstruction loss R loss in (1). However, the reconstruction Figure 2: Architecture of proposed model. The generator network and reconstructor network are a seq2seq hybrid pointer-generator network, but for simplicity, we omit the pointer and the attention parts. loss varies widely from sample to sample, and thus the rewards to the generator are not stable either. Hence we add a baseline to reduce their difference. We apply self-critical sequence training (Rennie et al., 2017); the modified reward r R (x,x) from reconstructor R with the baseline for the generator is is also the same cross-entropy reconstruction loss as l s (x,x), except thatx is obtained from y a instead of y s . y a is a word sequence {y a 1 , y a 2 , ..., y a n , ..., y a N }, where y a n is selected using the argmax function from the output distribution of generator y n . As in the early training stage, the sequence y s barely yields higher reward than sequence y a , to encourage exploration we introduce the second baseline score b, which gradually decreases to zero. Then, the generator is updated using the REINFORCE algorithm with reward r R (x,x) to minimize R loss .

GAN Training
With adversarial training, the generator learns to produce sentences as similar to the human-written sentences as possible. Here, we conduct experiments on two kinds of methods of language generation with GAN. In Section 5.1 we directly feed the generator output probability distributions to the discriminator and use a Wasserstein GAN (WGAN) with a gradient penalty. In Section 5.2, we explore adversarial REINFORCE, which feeds sampled discrete word sequences to the discriminator and evaluates the quality of the sequence from the discriminator for use as a reward signal to the generator.

Method 1: Wasserstein GAN
In the lower left of Fig. 2, the discriminator model of this method is shown as discriminator1 D 1 . D 1 is a deep CNN with residual blocks, which takes a sequence of word distributions as input and outputs a score. The discriminator loss D loss is where K denotes the number of training examples in a batch, and k denotes the k-th example. The last term is the gradient penalty (Gulrajani et al., 2017). We interpolate the generator output layer G(x) and the real sample y real , and apply the gradient penalty to the interpolated sequence y i . β 1 determines the gradient penalty scale. In Equation (2), for WGAN, the generator maximizes D � loss :

Method 2: Self-Critic Adversarial REINFORCE
In this section, we describe in detail the proposed adversarial REINFORCE method. The core idea is we use the LSTM discriminator to evaluate the current quality of the generated sequence {y s 1 , y s 2 , ..., y s i } at each time step i. The generator knows that compared to the last time step, as the generated sentence either improves or worsens, it can easily find the problematic generation step in a long sequence, and thus fix the problem easily.

Discriminator 2
As shown in Fig. 2, the discriminator2 D 2 is a unidirectional LSTM network which takes a discrete word sequence as input. At time step i, given input word y s i it predicts the current score s i based on the sequence {y 1 , y 2 , ..., y i }. The score is viewed as the quality of the current sequence. An example of discriminator regularized by weight clipping  is shown in Fig. 3. Figure 3: When the second arrested appears, as the sentence becomes ungrammatical, the discriminator determines that this example comes from the generator. Hence, after this time-step, it outputs low scores.
In order to compute the discriminator loss D loss , we sum the scores {s 1 , s 2 , ..., s N } of the whole sequence y s to yield where N denotes the generated sequence length. Then, the loss of discriminator is Similar to previous section, the last term is gradient penalty term. With the loss mentioned above, the discriminator attempts to quickly determine whether the current sequence is real or fake. The earlier the timestep discriminator determines whether the current sequence is real or fake, the lower its loss.

Self-Critical Generator
Since we feed a discrete sequence y s to the discriminator, the gradient from the discriminator cannot directly back-propagate to the generator. Here, we use the policy gradient method. At timestep i, we use the i − 1 timestep score s i−1 from the discriminator as its self-critical baseline. The reward r D i evaluates whether the quality of sequence in timestep i is better or worse than that in timestep i − 1. The generator reward r D i from D 2 is However, some sentences may be judged as bad sentences at the previous timestep, but at later timesteps judged as good sentences, and vice versa. Hence we use the discounted expected reward d with discount factor γ to calculate the discounted reward d i at time step i as To maximize the expected discounted reward d i , the loss of generator is: (5) We use the likelihood ratio trick to approximate the gradient to minimize (5).

Experiment
Our model was evaluated on the English/Chinese Gigaword datasets and CNN/Daily Mail dataset. In Section 6.1,6.2 and 6.4, the experiments were conducted on English Gigaword, while the experiments were conducted on CNN/Daily Mail dataset and Chinese Gigaword dataset respectively in Sections 6.3 and 6.6. We used ROUGE (Lin, 2004) as our evaluation metric. 3 During testing, when using the generator to generate summaries, we used beam search with beam size=5, and we eliminated repetition. We provide the details of the implementation and corpus re-processing respectively in Appendix A and B.
Before jointly training the whole model, we pre-trained the three major components -generator, discriminator, and reconstructor -separately. First, we pre-trained the generator in an unsupervised manner so that the generator would be able to somewhat grasp the semantic meaning of the source text. The details of the pre-training are in Appendix C. We pre-trained the discriminator and reconstructor respectively with the pre-trained generator's output to ensure that these two critic networks provide good feedback to the generator. In part (A), the model was trained supervisedly. In row (B-1), we select the article's first eight words as its summary. Part (C) are the results obtained without paired data. In part (D), we trained our model with few labeled data. In part (E), we pre-trained generator on CNN/Diary and used the summaries from CNN/Diary as real data for the discriminator.

English Gigaword
The English Gigaword is a sentence summarization dataset which contains the first sentence of each article and its corresponding headlines. The preprocessed corpus contains 3.8M training pairs and 400K validation pairs. We trained our model on part of or fully unparalleled data on 3.8M training set. To have fair comparison with previous works, the following experiments were evaluated on the 2K testing set same as (Rush et al., 2015;Miao and Blunsom, 2016). We used the sentences in article headlines as real data for discriminator 4 . As shown in the following experiments, the headlines can even come from another set of documents not related to the training documents. The results on English Gigaword are shown in Table 1. WGAN and adversarial REINFORCE refer to the adversarial training methods mentioned in Sections 5.1 and 5.2 respectively. Results trained by full labeled data are in part (A). In row (A-1), We trained our generator by su-pervised training. Compared with the previous work (Zhou et al., 2017), we used simpler model and smaller vocabulary size. We did not try to achieve the state-of-the-art results because the focus of this work is unsupervised learning, and the proposed approach is independent to the summarization models used. In row (B-1), we simply took the first eight words in a document as its summary.
The results for the pre-trained generator with method mentioned in Appendix.C is shown in row (C-1). In part (C), we directly took the sentences in the summaries of Gigaword as the training data of discriminator. Compared with the pre-trained generator and the trivial baseline , the proposed approach (rows (C-2) and (C-3)) showed good improvement. In Fig. 4, we provide a real example. More examples can be found in the Appendix.D.

Semi-Supervised Learning
In semi-supervised training, generator was pretrained with few available labeled data. During training, we conducted teacher-forcing with labeled data on generator after several updates without labeled data. With 10K, 500K and 1M la-beled data, the teacher-forcing was conducted every 25, 5 and 3 updates without paired data, respectively. In teacher-forcing, given source text as input, the generator was teacher-forced to predict the human-written summary of source text. Teacher-forcing can be regarded as regularization of unpaired training that prevents generator from producing unreasonable summaries of source text. We found that if we teacher-forced generator too frequently, generator would overfit on training data since we only used very few labeled data on semi-supervised training.
The performance of semi-supervised model in English Gigaword regarding available labeled data is shown in Table 1 part (D). We compared our results with (Miao and Blunsom, 2016) which was the previous state-of-the-art method on semisupervised summarization task under the same amount of labeled data. With both 500K and 1M labeled data, our method performed better. Furthermore, with only 1M labeled data, using adversarial REINFORCE even outperformed supervised training in Table 1 (A-1) with the whole 3.8M labeled data.  Table 1. The proposed methods generated summaries that grasped the core idea of the articles.

CNN/Daily Mail dataset
The CNN/Daily Mail dataset is a long text summarization dataset which is composed of news articles paired with summaries. We evaluated our model on this dataset because it's a popular benchmark dataset, and we want to know whether the proposed model works on long input and long output sequences. The details of corpus preprocessing can be found in Appendix.B . In unpaired training, to prevent the model from directly matching the input articles to its corresponding summaries, we split the training pairs into two equal sets, one set only supplied articles and the other set only supplied summaries.
The results are shown in Table 2. For supervised approaches in part (A), although our seq2seq model was similar to (See et al., 2017), due to the smaller vocabulary size (we didn't tackle outof-vocabulary words), simpler model architecture, shorter output length of generated summaries, there was a performance gap between our model and the scores reported in (See et al., 2017). Compared to the lead-3 baseline in part (B) which took the first three sentences of articles as summaries, the seq2seq models fell behind. That was because news writers often put the most important information in the first few sentences, and thus even the best abstractive summarization model only slightly beat the lead-3 baseline on ROUGE scores. However, during pre-training or training we didn't make assumption that the most important sentences are in first few sentences.
We observed that our unpaired model yielded decent ROUGE-1 score, but it yielded lower ROUGE-2 and ROUGE-L score. That was probably because the length of our generated sequence was shorter than ground truth, and our vocabulary size was small. Another reason was that the generator was good at selecting the most important words from the articles, but sometimes failed to combine them into reasonable sentences because it's still difficult for GAN to generate long sequence. In addition, since the reconstructor only evaluated the reconstruction loss of whole sequence, as the generated sequence became long, the reconstruction reward for generator became extremely sparse. However, compared to pretrained generator (rows (C-2), (C-3) v.s. (C-1)), our model still enhanced the ROUGE score. An real example of generated summary can be found at Appendix.D Fig.11 .

Transfer Learning
The experiments conducted up to this point required headlines unpaired to the documents but in the same domain to train discriminator. In this subsection, we generated the summaries from English Gigaword (target domain), but the summaries for discriminator were from CNN/Daily Mail dataset (source domain).
The results of transfer learning are shown in Table. 1 part (E).   trained generator and the poor pre-training result indicates that the data distributions of two datasets are quite different. We find that using sentences from another dataset yields lower ROUGE scores on the target testing set (parts (E) v.s. (C)) due to the mismatch word distributions between the summaries of the source and target domains. However, the discriminator still regularizes the generated word sequence. After unpaired training, the model enhanced the ROUGE scores of the pretrained model (rows (E-2), (E-3) v.s. (E-1)) and it also surpassed the trivial baselines in part (B).

GAN Training
In this section, we discuss the performance of two GAN training methods. As shown in the Table 1, in English Gigaword, our proposed adversarial REINFORCE method performed better than WGAN. However, in Table 2, our proposed method slightly outperformed by WGAN. In addition, we find that when training with WGAN, convergence is faster. Because WGAN directly evaluates the distance between the continuous distribution from generator and the discrete distribution from real data, the distribution was sharpened at an early stage in training. This caused generator to converge to a relatively poor place. On the other hand, when training with REINFORCE, generator keeps seeking the network parameters that can better fool discriminator. We believe that training GAN on language generation with this method is worth exploring.

Chinese Gigaword
The Chinese Gigaword is a long text summarization dataset composed of paired headlines and news. Unlike the input news in English Gigaword, the news in Chinese Gigaword consists of several sentences. The results are shown in Table 3. Row (A) lists the results using 1.1M documentsummary pairs to directly train the generator without the reconstructor and discriminator: this is the upper bound of the proposed approach. In row (B), we simply took the first fifteen words in a document as its summary. The number of words was chosen to optimize the evaluation metrics. Part (C) are the results obtained in the scenario without paired data. The discriminator took the summaries in the training set as real data. We show the results of the pre-trained generator in row (C-1); rows (C-2) and (C-3) are the results for the two GAN training methods respectively. We find that despite the performance gap between the unpaired and supervised methods (rows (C-2), (C-3) v.s. (A)), the proposed method yielded much better performance than the trivial baselines (rows (C-2), (C-3) v.s. (B)).

Conclusion and Future Work
Using GAN, we propose a model that encodes text as a human-readable summary, learned without document-summary pairs. In future work, we hope to use extra discriminators to control the style and sentiment of the generated summaries.