Low-Resource Neural Headline Generation

Recent neural headline generation models have shown great results, but are generally trained on very large datasets. We focus our efforts on improving headline quality on smaller datasets by the means of pretraining. We propose new methods that enable pre-training all the parameters of the model and utilize all available text, resulting in improvements by up to 32.4% relative in perplexity and 2.84 points in ROUGE.


Introduction
Neural headline generation (NHG) is the process of automatically generating a headline based on the text of the document using artificial neural networks.
Headline generation is a subtask of text summarization. While a summary may cover multiple documents, generally uses similar style to the summarized document, and consists of multiple sentences, headline, in contrast, covers a single document, is often written in a different style (Headlinese (Mårdh, 1980)), and is much shorter (frequently limited to a single sentence).
Due to shortness and specific style, condensing the the document into a headline often requires the ability to paraphrase which makes this task a good fit for abstractive summarization approaches where neural networks based attentive encoderdecoder (Bahdanau et al., 2015) type of models have recently shown impressive results (e.g., Rush et al. (2015); ).
While state-of-the art results have been obtained by training NHG models on large datasets like Gigaword, access to such resources is often not possible, especially when it comes to low-resource languages. In this work we focus on maximizing performance on smaller datasets with different pre-training methods.
One of the reasons to expect pre-training to be an effective way to improve performance on small datasets, is that NHG models are generally trained to generate headlines based on just a few first sentences of the documents (Rush et al., 2015;Shen et al., 2016;Chopra et al., 2016;. This leaves the rest of the text unutilized, which can be alleviated by pre-training subsets of the model on full documents. Additionally, the decoder component of NHG models can be regarded as a language model (LM) whose predictions are biased by the external information from the encoder. As a LM it sees only headlines during training, which is a small fraction of text compared to the documents. Supplementing the training data of the decoder with documents via pre-training might enable it to learn more about words and language structure.
Although, some of the previous work has used pre-training before Alifimoff, 2015), it is not fully explored how much pretraining helps and what is the optimal way to do it. Another problem is, that in previous work only a subset of parameters (usually just embeddings) is pre-trained leaving the rest of the parameters randomly initialized.
The main contributions of this paper are: LM pre-training for fully initializing the encoder and decoder (sections 2.1 and 2.2); combining LM pre-training with distant supervision (Mintz et al., 2009) pre-training using filtered sentences of the documents as noisy targets (i.e. predicting one sentence given the rest) to maximally utilize the entire available dataset and pre-train all the paramters of the NHG model (section 2.3); and analysis of the effect of pre-training different components of the NHG model (section 3.3).

Encoder Attention
Init.
Decoder y t Figure 1: A high level description of the NHG model. The model predicts the next headline word y t given the words in the document x 1 . . . x N and already generated headline words y 1 . . . y t−1 .

Method
The model that we use follows the architecture described by Bahdanau et al. (2015). Although originally created for neural machine translation, this architecture has been successfully used for NHG (e.g., by Shen et al. (2016);  and in a simplified form by Chopra et al. (2016)).
The NHG model consists of: a bidirectional (Schuster and Paliwal, 1997) encoder with gated recurrent units (GRU) (Cho et al., 2014); a unidirectional GRU decoder; and an attention mechanism and a decoder initialization layer that connect the encoder and decoder (Bahdanau et al., 2015).
During headline generation, the encoder reads and encodes the words of the document. Initialized by the encoder, the decoder then starts generating the headline one word at a time, attending to relevant parts in the document using the attention mechanism ( Figure 1). During training the parameters are optimized to maximize the probabilities of reference headlines.
While generally at the start of training either the parameters of all the components are randomly initialized or only pre-trained embeddings (with dashed outline in Figure 1) are used Paulus et al., 2017;, we propose pre-training methods for more extensive initialization.

Encoder Pre-Training
When training a NHG model, most approaches generally use a limited number of first sentences or tokens of the document. as the input sequences are shorter) and effective (the most informative content tends to be at the beginning of the document ), this leaves the rest of the sentences in the document unused. Better understanding of words and their context can be learned if all sentences are used, especially on small training sets.
To utilize the entire training set, we pre-train the encoder on all the sentences of the training set documents. Since the encoder consists of two recurrent components -a forward and backward GRU -we pre-train them separately. First we add a softmax output layer to the forward GRU and train it on the sentences to predict the next word given the previous ones (i.e. we train it as a LM). After convergence on the validation set sentences, we take the embedding weights of the forward GRU and use them as fixed parameters for the backward GRU. Then we train the backwards GRU following the same procedure as with the forward GRU, with the exception of processing the sentences in a reverse order. When both models are fully trained, we remove the softmax output layers and initialize the encoder of the NHG model with the embeddings and GRU parameters of the trained LMs (highlighted with gray background in Figure 1).

Decoder Pre-Training
Pre-training the decoder as a LM seems natural, since it is essentially a conditional LM. During NHG model training the decoder is fed only headline words, which is relatively little data compared to the document contents. To improve the quality of the headlines it is essential to have high quality embeddings that are a good semantic representation of the input words and to have a well trained recurrent and output layer to predict sensible words that make up coherent sentences. When it comes to statistical models, the simplest way to improve the quality of the parameters is to train the model on more data, but it also has to be the right kind of data (Moore and Lewis, 2010).
To increase the amount of suitable training data for the decoder we use LM pre-training on filtered sentences of the training set documents. For filtering we use the XenC tool by Rousseau (2013) with the cross-entropy difference filtering (Moore and Lewis, 2010). In our case the in-domain data is training set headlines, out-domain data is the sentences from training set documents, and the best cut-off point is evaluated on validation set head-lines. The careful selection of sentences is mostly motivated by preventing the pre-trained decoder from deviating too much from Headlinese, but it also reduces training time.
Before pre-training we initialize the input and output embeddings of the LM for words that are common in both encoder and decoder vocabulary with the corresponding pre-trained encoder embeddings. We train the LM on the selected sentences until perplexity on the validation set headlines stops improving and then use it to initialize the decoder parameters of the NHG model (highlighted with dotted background in Figure 1).
A similar approach, without data selection and embedding initialization, has also been used by Alifimoff (2015).

Distant Supervision Pre-Training
Approaches described in sections 2.1 and 2.2 enable full pre-training of the encoder and decoder, but this still leaves the connecting parameters (with white background in Figure 1) untrained.
As results in language modelling suggest, surrounding sentences contain useful information to predict words in the current sentence (Wang and Cho, 2016). This implies that other sentences contain informative sections that the attention mechanism can learn to attend to and general context that the initialization component can learn to extract.
To utilize this phenomenon, we propose using carefully picked sentences from the documents as pseudo-headlines and pre-train the NHG model to generate these given the rest of sentences in the document. Our pseudo-headline picking strategy consists of choosing sentences that occur within 100 first tokens of the document and were retained during cross-entropy filtering in section 2.2. Picking sentences from the beginning of the document should give us the most informative sentences, and cross-entropy filtering keeps sentences that most closely resemble headlines.
The pre-training procedure starts with initializing the encoder and decoder with LM pre-trained parameters (sections 2.1 and 2.2). After that, we continue training the attention and initialization parameters until perplexity on validation set headlines converges. We then use the trained parameters to initialize all parameters of the NHG model. Distant supervision has been also used for multi-document summarization by Bravo-Marquez and Manriquez (2012 (Klakow and Peters, 2002). All pre-trained models are significantly better than the No pre-training baseline.

Experiments
We evaluate the proposed pre-training methods in terms of ROUGE and perplexity on two relatively small datasets (English and Estonian).

Training Details
All our models use hidden layer sizes of 256 and the weights are initialized according to Glorot and Bengio (2010). The vocabularies consist of up to 50000 most frequent training set words that occur at least 3 times. The model is implemented in Theano (Bergstra et al., 2010;Bastien et al., 2012) and trained on GPUs using mini-batches of size 128. During training the weights are updated with Adam (Kingma and Ba, 2014) (parameters: α=0.001, β 1 =0.9, β 2 =0.999, =10 −8 and λ=1 − 10 −8 ) and L 2 -norm of the gradient is kept within a threshold of 5.0 (Pascanu et al., 2013). During headline generation we use beam search with beam size 5.  Table 2: Recall and precision of ROUGE-1 and ROUGE-L on the test sets. Best scores in bold. Results with statistically significant differences (95% confidence) compared to No pre-training underlined.

Datasets
We use the CNN/Daily Mail dataset (Hermann et al., 2015) 1 for experiments on English (EN). The number of headline-document pairs is 287227, 13368 and 11490 in training, validation and test set correspondingly. The preprocessing consists of tokenization, lowercasing, replacing numeric characters with #, and removing irrelevant parts (editor notes, timestamps etc.) from the beginning of the document with heuristic rules.
For Estonian (ET) experiments we use a similarly sized (341607, 18979 and 18977 training, validation and test split) dataset that also consist of news from two sources. During preprocessing, compound words are split, words are truecased and numbers are written out as words. We used Estnltk (Orasmaa et al., 2016) stemmer for ROUGE evaluations.

Results and Analysis
Models are evaluated in terms of perplexity (PPL) and full length ROUGE (Lin, 2004). In addition to pre-training methods described in sections 2.1-2.3, we also test: initializing only the embeddings using parameters from the LM pre-trained encoder and decoder (Embeddings); initializing the encoder and decoder, but leaving connecting parameters randomized (Enc.+dec.); pre-training the whole model from random initialization with distant supervision only (Distant all); and a baseline that is not pre-trained at all (No pre-training).
All pre-training methods gave significant improvements in PPL (Table 1). The best method (Enc.+dec.+dist.) improved the test set PPL by 29.6-32.4% relative. Pre-trained NHG models also converged faster during training (Figure 2) 1 http://cs.nyu.edu/˜kcho/DMQA/ and most of them beat the final PPL of the baseline already after the first epoch. General trend is that pre-training a larger amount of parameters and the parameters closer to the outputs of the NHG model improves the PPL more. Distant all is an exception to that observation as it used much less training data (same as baseline) than other methods.
For ROUGE evaluations, we report ROUGE-1 and ROUGE-L (Table 2). In contrast with PPL evaluations, some pre-training methods either don't improve significantly or even worsen ROUGE measures. Another difference compared to PPL evaluations is that for ROUGE, pretraining parameters that reside further from outputs (embeddings and encoder) seems more beneficial. This might imply that a better document representation is more important to stay on topic during beam search while it is less important during PPL evaluation where predicting next target headline word with high confidence is rewarded and the process is aided by previous target headline words that are fed to the decoder as inputs. It is also possible, that a well trained decoder becomes too reliant on expecting correct words as inputs making it sensitive to errors during generation which would somewhat explain why Enc.+dec. performs worse than Encoder alone. This hypothesis can be checked in further work by experimenting with methods like scheduled sampling

Conclusions
We proposed three new NHG model pre-training methods that in combination enable utilizing the entire dataset and initializing all parameters of the NHG model. We also evaluated and analyzed pretraining methods and their combinations in terms of perplexity (PPL) and ROUGE. The results revealed that better PPL doesn't necessarily translate to better ROUGE -PPL tends to benefit from pre-training parameters that are closer to outputs, but for ROUGE it is generally the opposite. Also, PPL benefited from pre-training more parameters while for ROUGE it was not always the case. Pretraining in general proved to be useful -our best results improved PPL by 29.6-32.4% relative and ROUGE measures by 0.85-2.84 points compared to a NHG model without pre-training.
Current work focused on maximally utilizing available headlined corpora. One interesting future direction would be to additionally utilize potentially much more abundant corpora of documents without headlines (also proposed by Shen et al. (2016)) for pre-training. Another open question is the relationship between the dataset size and the effect of pre-training.