Autoencoder as Assistant Supervisor: Improving Text Representation for Chinese Social Media Text Summarization

Most of the current abstractive text summarization models are based on the sequence-to-sequence model (Seq2Seq). The source content of social media is long and noisy, so it is difficult for Seq2Seq to learn an accurate semantic representation. Compared with the source content, the annotated summary is short and well written. Moreover, it shares the same meaning as the source content. In this work, we supervise the learning of the representation of the source content with that of the summary. In implementation, we regard a summary autoencoder as an assistant supervisor of Seq2Seq. Following previous work, we evaluate our model on a popular Chinese social media dataset. Experimental results show that our model achieves the state-of-the-art performances on the benchmark dataset.


Introduction
Text summarization is to produce a brief summary of the main ideas of the text. Unlike extractive text summarization (Radev et al., 2004;Woodsend and Lapata, 2010;Cheng and Lapata, 2016), which selects words or word phrases from the source texts as the summary, abstractive text summarization learns a semantic representation to generate more human-like summaries. Recently, most models for abstractive text summarization are based on the sequence-to-sequence model, which encodes the source texts into the semantic representation with an encoder, and generates the summaries from the representation with a decoder.
The contents on the social media are long, and contain many errors, which come from spelling mistakes, informal expressions, and grammatical mistakes (Baldwin et al., 2013). Large amount of errors in the contents cause great difficulties for text summarization. As for RNN-based Seq2Seq, it is difficult to compress a long sequence into an accurate representation (Li et al., 2015), because of the gradient vanishing and exploding problem.
Compared with the source content, it is easier to encode the representations of the summaries, which are short and manually selected. Since the source content and the summary share the same points, it is possible to supervise the learning of the semantic representation of the source content with that of the summary.
In this paper, we regard a summary autoencoder as an assistant supervisor of Seq2Seq. First, we train an autoencoder, which inputs and reconstructs the summaries, to obtain a better representation to generate the summaries. Then, we supervise the internal representation of Seq2Seq with that of autoencoder by minimizing the distance between two representations. Finally, we use adversarial learning to enhance the supervision. Following the previous work (Ma et al., 2017), We evaluate our proposed model on a Chinese social media dataset. Experimental results show that our model outperforms the state-of-theart baseline models. More specifically, our model outperforms the Seq2Seq baseline by the score of 7.1 ROUGE-1, 6.1 ROUGE-2, and 7.0 ROUGE-L.

Proposed Model
We introduce our proposed model in detail in this section.

Notation
Given a summarization dataset that consists of N data samples, the i th data sample (x i , y i ) con- tains a source content x i = {x 1 , x 2 , ..., x M }, and a summary y i = {y 1 , y 2 , ..., y L }, while M is the number of the source words, and L is the number of the summary words. At the training stage, we train the model to generate the summary y given the source content x. At the test stage, the model decodes the predicted summary y given the source content x. Figure 1 shows the architecture of our model. At the training stage, the source content encoder compresses the input contents x into the internal representation z t with a Bi-LSTM encoder. At the same time, the summary encoder compresses the reference summary y into the representation z s . Then both z t and z s are fed into a LSTM decoder to generate the summary. Finally, the semantic representation of the source content is supervised by the summary. We implement the supervision by minimizing the distance between the semantic representations z t and z s , and this term in the loss function can be written as:

Supervision with Autoencoder
where d(z t , z s ) is a function which measures the distance between z s and z t . λ is a tunable hyperparameter to balance the loss of the supervision and the other parts of the loss, and N h is the number of the hidden unit to limit the magnitude of the distance function. We set λ = 0.3 based on the performance on the validation set. The distance between two representations can be written as:

Adversarial Learning
We further enhance the supervision with the adversarial learning approach. As shown in Eq. 1, we use a fixed hyper-parameter as a weight to measure the strength of the supervision of the autoencoder. However, in the case when the source content and summary have high relevance, the strength of the supervision should be higher, and when the source content and summary has low relevance, the strength should be lower. In order to determine the strength of supervision more dynamically, we introduce the adversarial learning. More specifically, we regard the representation of the autoencoder as the "gold" representation, and that of the sequence-to-sequence as the "fake" representation. A model is trained to discriminate between the gold and fake representations, which is called a discriminator. The discriminator tries to identify the two representations. On the contrary, the supervision, which minimizes the distance of the representations and makes them similar, tries to prevent the discriminator from making correct predictions. In this way, when the discriminator can distinguish the two representations (which means the source content and the summary has low relevance), the strength of supervision will be decreased, and when the discriminator fails to distinguish, the strength of supervision will be improved.
In implementation of the adversarial learning, the discriminator objective function can be written as: where P θ D (y = 1|z) is the probability that the discriminator identifies the vector z as the "gold" representation, while P θ D (y = 0|z) is the probability that the vector z is identified as the "fake" representation, and θ D is the parameters of the discrim-inator. When minimizing the discriminator objective, we only train the parameters of the discriminator, while the rest of the parameters remains unchanged.
The supervision objective to be against the discriminator can be written as: When minimizing the supervision objective, we only update the parameters of the encoders.

Loss Function and Training
There are several parts of the objective functions to optimize in our models. The first part is the cross entropy losses of the sequence-to-sequence and the autoencoder: The second part is the L2 loss of the supervision, as written in Equation 1. The last part is the adversarial learning, which are Equation 3 and Equation 4. The sum of all these parts is the final loss function to optimize. We use the Adam (Kingma and Ba, 2014) optimization method to train the model. For the hyper-parameters of Adam optimizer, we set the learning rate α = 0.001, two momentum parameters β 1 = 0.9 and β 2 = 0.999 respectively, and = 1 × 10 −8 . We clip the gradients (Pascanu et al., 2013) to the maximum norm of 10.0.

Experiments
Following the previous work (Ma et al., 2017), we evaluate our model on a popular Chinese social media dataset. We first introduce the datasets, evaluation metrics, and experimental details. Then, we compare our model with several state-of-the-art systems.

Dataset
Large Scale Chinese Social Media Text Summarization Dataset (LCSTS) is constructed by Hu et al. (2015). The dataset consists of more than 2,400,000 text-summary pairs, constructed from a famous Chinese social media website called Sina Weibo. 2 It is split into three parts, with 2,400,591 2 http://weibo.com pairs in PART I, 10,666 pairs in PART II and 1,106 pairs in PART III. All the text-summary pairs in PART II and PART III are manually annotated with relevant scores ranged from 1 to 5. We only reserve pairs with scores no less than 3, leaving 8,685 pairs in PART II and 725 pairs in PART III. Following the previous work (Hu et al., 2015), we use PART I as training set, PART II as validation set, and PART III as test set.

Evaluation Metric
Our evaluation metric is ROUGE score (Lin and Hovy, 2003), which is popular for summarization evaluation. The metrics compare an automatically produced summary with the reference summaries, by computing overlapping lexical units, including unigram, bigram, trigram, and longest common subsequence (LCS). Following previous work (Rush et al., 2015;Hu et al., 2015), we use ROUGE-1 (unigram), ROUGE-2 (bi-gram) and ROUGE-L (LCS) as the evaluation metrics in the reported experimental results.

Experimental Details
The vocabularies are extracted from the training sets, and the source contents and the summaries share the same vocabularies. In order to alleviate the risk of word segmentation mistakes, we split the Chinese sentences into characters. We prune the vocabulary size to 4,000, which covers most of the common characters.
We tune the hyper-parameters based on the ROUGE scores on the validation sets. We set the word embedding size and the hidden size to 512, and the number of LSTM layers is 2. The batch size is 64, and we do not use dropout (Srivastava et al., 2014) on this dataset. Following the previous work , we implement the beam search, and set the beam size to 10.

Baselines
We compare our model with the following stateof-the-art baselines.
• RNN and RNN-cont are two sequence-tosequence baseline with GRU encoder and decoder, provided by Hu et al. (2015). The difference between them is that RNN-context has attention mechanism while RNN does not.
• CopyNet (Gu et al., 2016) incorporates a copy mechanism to allow parts of the generated summary are copied from the source content.
• SRB (Ma et al., 2017) is a sequence-tosequence based neural model with improving the semantic relevance between the input text and the output summary.
• DRGD ) is a deep recurrent generative decoder model, combining the decoder with a variational autoencoder.
• Seq2seq is our implementation of the sequence-to-sequence model with the attention mechanism, which has the same experimental setting as our model for fair comparison.

Results
For the purpose of simplicity, we denote our supervision with autoencoder model as superAE. We report the ROUGE F1 score of our model and the baseline models on the test sets.  Table 2: Accuracy of the sentiment classification on the Amazon dataset. We train a classifier which inputs internal representation provided by the sequence-to-sequence model, and outputs a predicted label. We compute the 2-class and 5class accuracy of the predicted labels to evaluate the quality of the text representation.
superAE model has a large improvement over the Seq2Seq baseline by 7.1 ROUGE-1, 6.1 ROUGE-2, and 7.0 ROUGE-L, which demonstrates the efficiency of our model. Moreover, we compare our model with the recent summarization systems, which have been evaluated on the same training set and the test sets as ours. Their results are directly reported in the referred articles. It shows that our superAE outperforms all of these models, with a relative gain of 2.2 ROUGE-1, 1.8 ROUGE-2, and 2.0 ROUGE-L. We also perform ablation study by removing the adversarial learning component, in order to show its contribution. It shows that the adversarial learning improves the performance of 1.5 ROUGE-1, 0.7 ROUGE-2, and 1.0 ROUGE-L. We also give a summarization examples of our model. As shown in Table 3, the SeqSeq model captures the wrong meaning of the source content, and produces the summary that "China United Airlines exploded in the airport". Our superAE model captures the correct points, so that the generated summary is close in meaning to the reference summary.

Analysis of text representation
We want to analyze whether the internal text representation is improved by our superAE model. Since the text representation is abstractive and hard to evaluate, we translate the representation into a sentiment score with a sentiment classifier, and evaluate the quality of the representation by means of the sentiment accuracy.
We perform experiments on the Amazon Fine Foods Reviews Corpus (McAuley and Leskovec, 2013). The Amazon dataset contains users' rating labels as well as the summary for the reviews, making it possible to train a classifier to predict the sentiment labels and a seq2seq model to generate summaries. First, we train the superAE model and Source: 昨晚，中联航空成都飞北京一架航 班被发现有多人吸烟。后因天气原因，飞 机备降太原机场。有乘客要求重新安检， 机长决定继续飞行，引起机组人员与未吸 烟乘客冲突。 Last night, several people were caught to smoke on a flight of China United Airlines from Chendu to Beijing. Later the flight temporarily landed on Taiyuan Airport. Some passengers asked for a security check but were denied by the captain, which led to a collision between crew and passengers. Reference: 航班多人吸烟机组人员与乘客 冲突。 Several people smoked on a flight which led to a collision between crew and passengers. Seq2Seq: 中联航空机场发生爆炸致多人死 亡。 China United Airlines exploded in the airport, leaving several people dead. +superAE: 成都飞北京航班多人吸烟机组 人员与乘客冲突。 Several people smoked on a flight from Chendu to Beijing, which led to a collision between crew and passengers. the seq2seq model with the text-summary pairs until convergence. Then, we transfer the encoders to a sentiment classifier, and train the classifier with fixing the parameters of the encoders. The classifier is a simple feedforward neural network which maps the representation into the label distribution. Finally, we compute the accuracy of the predicted 2-class labels and 5-class labels.
As shown in Table 2, the seq2seq model achieves 80.7% and 65.1% accuracy of 2-class and 5-class, respectively. Our superAE model outperforms the baselines with a large margin of 8.1% and 6.6%. Rush et al. (2015) first propose an abstractive based summarization model, which uses an attentive CNN encoder to compress texts and a neural network language model to generate summaries. Chopra et al. (2016) explore a recurrent structure for abstractive summarization. To deal with out-of-vocabulary problem, Nallapati et al. (2016) propose a generator-pointer model so that the decoder is able to generate words in source texts. Gu et al. (2016) also solved this issue by incorporating copying mechanism, allowing parts of the summaries are copied from the source contents. See et al. (2017) further discuss this problem, and incorporate the pointer-generator model with the coverage mechanism. Hu et al. (2015) build a large corpus of Chinese social media short text summarization, which is one of our benchmark datasets. Chen et al. (2016) introduce a distraction based neural model, which forces the attention mechanism to focus on the difference parts of the source inputs. Ma et al. (2017) propose a neural model to improve the semantic relevance between the source contents and the summaries.

Related Work
Our work is also related to the sequence-tosequence model , and the autoencoder model (Bengio, 2009;Liou et al., 2008Liou et al., , 2014. Sequence-to-sequence model is one of the most successful generative neural model, and is widely applied in machine translation Jean et al., 2015;, text summarization (Rush et al., 2015;Chopra et al., 2016;Nallapati et al., 2016), and other natural language processing tasks. Autoencoder (Bengio, 2009) is an artificial neural network used for unsupervised learning of efficient representation. Neural attention model is first proposed by .

Conclusion
We propose a novel model, in which the autoencoder is a supervisor of the sequence-to-sequence model, to learn a better internal representation for abstractive summarization. An adversarial learning approach is introduced to further improve the supervision of the autoencoder. Experimental results show that our model outperforms the sequence-to-sequence baseline by a large margin, and achieves the state-of-the-art performances on a Chinese social media dataset.