Diverse, Controllable, and Keyphrase-Aware: A Corpus and Method for News Multi-Headline Generation

News headline generation aims to produce a short sentence to attract readers to read the news. One news article often contains multiple keyphrases that are of interest to different users, which can naturally have multiple reasonable headlines. However, most existing methods focus on the single headline generation. In this paper, we propose generating multiple headlines with keyphrases of user interests, whose main idea is to generate multiple keyphrases of interest to users for the news first, and then generate multiple keyphrase-relevant headlines. We propose a multi-source Transformer decoder, which takes three sources as inputs: (a) keyphrase, (b) keyphrase-filtered article, and (c) original article to generate keyphrase-relevant, high-quality, and diverse headlines. Furthermore, we propose a simple and effective method to mine the keyphrases of interest in the news article and build a first large-scale keyphrase-aware news headline corpus, which contains over 180K aligned triples of $<$news article, headline, keyphrase$>$. Extensive experimental comparisons on the real-world dataset show that the proposed method achieves state-of-the-art results in terms of quality and diversity


Introduction
News Headline Generation is an under-explored subtask of text summarization (See et al., 2017;Gehrmann et al., 2018;Zhong et al., 2019). Unlike text summaries that contain multiple context-related sentences to cover the main ideas of a document, news headlines often contain a single short sentence to encourage users to read the news. Since one news article typically contains multiple keyphrases or topics of interest to different users, it is useful to generate multiple headlines covering different keyphrases for the news article. Multi-headline generation aims to generate multiple independent headlines, which allows us to recommend news with different news headlines based on the interests of users. Besides, multi-headline generation can provide multiple hints for human news editors to assist them in writing news titles.
However, most existing methods (Takase et al., 2016;Ayana et al., 2016;Murao et al., 2019;Colmenares et al., 2019;Zhang et al., 2018) focus on single-headline generation. The headline generation process is treated as a one-to-one mapping (the input is an article, and the output is a headline), which trains and tests the models without any additional guiding information or constraints. We argue that this may lead to two problems. Firstly, since it is reasonable to generate multiple headlines for the news, the model training of generating the single ground-truth might result in a lack of more detailed guidance. Even worse, a single ground-truth without any constraint or guidance is often not enough to measure the quality of the generated headline for model testing. For example, if a generated headline is considered reasonable by humans, it can get a low score in ROUGE (Lin, 2004), because it might focus on the keyphrases or aspects that are not consistent with the ground-truth.
In this paper, we incorporate the keyphrase information into the headline generation as further guidance. Unlike the one-to-one mapping in previous works, we treat the headline generation process as a two-to-one mapping, where the inputs are news article and keyphrase, and the output is a headline. We propose a keyphrase-aware news multi-headline generation method, which contains two modules: (a) Keyphrase Keyphrase-Aware Multi-Headline Generation Model, which takes the news article and a keyphrase as input and generates a keyphrase-relevant news headline. For training models, we build a first large-scale news keyphrase-aware headline corpus that contains 180K aligned triples of <news article, headline, keyphrase>.
The proposed approach faces two major challenges. The first one is how to build the keyphrase-aware news headline corpus. To our best knowledge, no corpus contains the news article and headline pairs, which are aligned with a keyphrase of interest to users. The second is how to design the keyphraseaware news headline generation model to ensure that the generated headlines are keyphrase-relevant, high-quality, and diverse. For the first challenge, we propose a simple but efficient method to mine the keyphrases of interest to users in news articles based on the user search queries and news click information that are collected from a real-world search engine. With this method, we build the keyphrase-aware news headline corpus. For the second challenge, we design a multi-source Transformer (Vaswani et al., 2017) decoder to improve the generation quality and the keyphrase sensitivity of the model, which takes three source information as inputs: (a) keyphrase, (b) keyphrase-filtered article, and (c) original article. For the proposed multi-source Transformer decoder, we further design and compare several variants of attention-based fusing mechanism.
Extensive experiments on real-world dataset have shown that the proposed method can generate high-quality, keyphrase-relevant, and diverse news headlines.

Keyphrase-Aware News Headline Corpus
Our keyphrase-aware news headline corpus called KeyAware News is built by the following steps: (1) Data Collection. We collect 16,000,000 raw data pairs which contain news articles with user search query information from Microsoft Bing News search engine 2 . Each data pair can be presented as a tuple < Q, X, Y, C > where Q is a user search query, X is a news article that the search engine returns to the user based on the search query Q, Y is a human-written headline for X, and C represents the number of times the user clicks on the news under the search query Q. Each news article X has 10 different queries Q on average.
(2) Keyphrase Mining. We mine the keyphrase of interest to users with user search queries. We assume that if many users find and click on one news article through different queries containing the same phrase, such a phrase is the keyphrase for the article. For each article, we collect its corresponding user search queries and remove the stop words and special symbols from the queries. Then we find the Feed Forward common phrases (4-gram, 3-gram, or 2-gram) in these queries. These common phrases are scored based on how many times they appear in these queries and normalized by length. The score is also weighted by the user click number C, which means the phrases that appear in the queries have more users click on the article are more important. Finally, we use the n-gram with the highest score as the keyphrase Z of the article X.
(3) Article-Headline-Keyphrase Alignment. To obtain the aligned article-headline-keyphrase pair < X, Y, Z >. We filter out the data pair whose article or headline does not contain the Z. Moreover, we remove such pairs with article length longer than 600 tokens or shorter than 100 tokens, or whose headline length are longer than 20 tokens or less than 3 tokens. After the alignment and data cleaning, we obtain the KeyAware News which contains about 180K aligned article-headline-keyphrase pairs. We split it into Train, Test, and Dev sets, each containing 165,913, 10,000, and 5,000 data pairs.

Overview
The overall keyphrase-aware multi-headline generation procedure is shown in Figure 1, which involves two modules: (a) keyphrase generation model generates multiple keyphrases of interest to users for the news article. (b) keyphrase-aware headline generation model takes the news article and each generated keyphrase as input, and generates multiple keyphrase-relevant news headlines.

Headline Generation
The headline generation can be formalized as a sequence-to-sequence learning (Sutskever et al., 2014) task. Given an input news article X and a specific keyphrase Z, we aim to produce a keyphrase-relevant headline Y .

Headline Generation BASE Model
We first introduce the basic version of our headline generation model (we call BASE), which is keyphraseagnostic. BASE is built upon the Transformer Seq2Seq model (Vaswani et al., 2017), which has made remarkable progress in sequence-to-sequence learning. Transformer contains a multi-head self-attention encoder and a multi-head self-attention decoder. As discussed in (Vaswani et al., 2017), an attention function maps a query and a set of key-value pairs to an output as: where the queriesQ, keysK, and valuesV are all vectors, and d k is the dimension of the key vector. Multi-head attention mechanism further projects queries, keys, and values to h different representation subspaces and calculates corresponding attention as: The encoder is composed of a stack of N identical blocks. Each block has two sub-layers: multi-head self-attention mechanism and a position-wise fully connected feed-forward network. All sub-layers are interconnected with residual connections (He et al., 2016) and layer normalization (Ba et al., 2016). Similarly, the decoder is also composed of a stack of N identical block. In addition to the two sub-layers in each encoder block, the decoder contains a third sub-layer which performs multi-head attention over the output of the encoder. Figure 2 (a) shows the architecture of the block in the decoder. BASE uses the pre-trained BERT-base model (Devlin et al., 2018) to initialize the parameters of the encoder. Also, it uses the transformer decoder with a copy mechanism (Gu et al., 2016), whose hidden size, the number of multi-head h, and the number of blocks N are the same as its encoder.

Keyphrase-Aware Headline Generation Model
In order to explore more effective ways of incorporating keyphrase information into BASE, we design 5 variants of multi-source Transformer decoders. Article + Keyphrase. The basic idea is to add the keyphrase into the decoder directly. The keyphrase X key is represented as a sequence of word embeddings. As shown in Figure 2 (b), we add an extra sub-layer that performs multi-head attention over the X key in each block of the decoder.
where X (n) dec is the output of the n-th block in the decoder. Since the original article has contained sufficient information for the model to learn to generate the headline, the model may tend to mainly use the article information and ignore the keyphrase, and thus become less sensitive to keyphrases. Thus, the generated headlines may lack diversity and keyphrase relevance. Keyphrase-Filtered Article. Intuitively, when people read the news article, they tend to focus on the parts of the article that are matched to the keyphrases of their interests. Inspired by this observation, before inputting the original article representation into the decoder, we use the attention mechanism to filter the article with the keyphrase (see Figure 2 (c)).
where X enc is the output of the last block in the encoder. The resulting representationX enc can be seen as the keyphrase-filtered article, which mainly keeps the article information that is related to the keyphrase. Since the decoder cannot directly access the representation of the original article, the model is forced to utilize the information of the keyphrase. Therefore, the sensitivity of the model to keyphrase is improved. Fusing Keyphrase-Filtered Article and Original Article. Although feeding the keyphrase-filtered article representationX enc instead of the original article representation X enc to the decoder can improve  the sensitivity of the model to keyphrase, some useful and global information in the original article may also be filtered out. It might reduce the quality of the generated headlines. To further balance the keyphrase sensitivity and headline quality of the model, we use X enc andX enc as two input sources for the decoder and fuse them. As shown in Figure 2 (d)-(f), we design three decoder variants based on different fusing mechanism to fuse the X enc and theX enc .
(a) Addition-Fusing Mechanism. We directly perform a point-wise addition between the X enc and thê X enc . Then we feed it into the decoder.
(b) Stack-Fusing Mechanism. We perform multi-head attention onX enc and X enc one by one in each block of the decoder. All of the sub-layers are interconnected with residual connections.
(c) Parallel-Fusing Mechanism. For each block of the decoder, we perform multi-head attention in parallel onX enc and X enc . Then, we perform a point-wise addition between them. Similarly, all of the sub-layers are interconnected with residual connections.

Keyphrase Generation
In this subsection, we show how to generate the keyphrases for a given news article X. Here we briefly describe three methods for keyphrase generation. It should be noted that in this paper, we mainly focus on news headline generation rather than keyphrase generation.
(1) TF-IDF Ranking. We use Term Frequency Inverse Document Frequency (TF-IDF) (Zhang et al., 2007) to weight all n-grams (n = 2, 3, and 4) in the news article X. Then we filter out n-grams with TF-IDF below the threshold or containing any punctuation or special character. For different n of the n-gram, we set different thresholds for filtering. We take this unsupervised method as a baseline.
(2) Seq2Seq. Since our KeyAware News corpus contains the article-keyphrase pairs, we treat the keyphrase generation as a sequence-to-sequence learning task. We train the model BASE with articlekeyphrase pairs. During inference, we use beam search with length penalty to generate n-grams (n = 2, 3, and 4) as the keyphrases.
(3) Slot Tagging. Because the keyphrases also appear in the news articles, we can formulate the keyphrase generation task as a slot tagging task Williams, 2019). We fine-tune the BERT-base model to achieve that. We use the output sequence of the model to predict the beginning and end position of the keyphrase in the article. During inference, we follow the answer span prediction method used in (Seo et al., 2017) to predict n-grams (n = 2, 3, and 4) with the highest probabilities as the keyphrases.

Keyphrase Generation
In the first experiment, we evaluate the performance of three keyphrase generation methods: (a) unsupervised TF-IDF Ranking, (b) supervised sequence-to-sequence model (SEQ2SEQ), and (c) supervised slot tagging model (SLOT).  We use a top-K exact-match rate (EM@K) as an evaluation metric, which tests whether one of the K generated keyphrases matches the golden keyphrase exactly. Some of the generated key phrases may not exactly match the golden keyphrase but have overlapping tokens with it (it may be a sub-sequence of the golden keyphrase or vice versa). We thus report the Recall@K (R@K), which tests the percentage of the tokens in golden keyphrase covered by the K generated keyphrases.
Results. The results are shown in Table 1. We can see that the EM@1 of TF-IDF is only 18.63%, but SLOT achieves 60.75%. Both of SEQ2SEQ and SLOT significantly outperform the TF-IDF in all metrics. SEQ2SEQ achieves comparable performances in EM@K, but performs worse than SLOT in R@K. SLOT achieves 83.18% EM@5 and 89.08% R@5. In the following experiments, we use SLOT to generate keyphrases for our keyphrase-aware news headline generation models.

News Headline Generation
Baselines. In the following experiments, we compare various variants of the proposed keyphrase-aware models we introduced in Section 3.2.1 as follows: (1) BASE, as shown in Figure 2 (a), which is keyphrase-agnostic and only takes the news article as input.
(2) BASE + KEY, as shown in Figure 2 (b), which takes keyphrase and article as input.
(3) BASE + Filter, as shown in Figure 2 (c), which takes keyphrase-filtered article as input. (4) BASE + StackFuse, (5) BASE + AddFuse, and (6) BASE + ParallelFuse as shown in Figure 2 (d-f), which take the keyphrase-filtered article and the original article as inputs with stack-fusing, addition-fusing, and parallel-fusing mechanism, respectively. Based on BASE + StackFuse, BASE + AddFuse, and BASE + ParallelFuse, we further use the keyphrase as their additional inputs, like BASE + KEY. Then we obtain three additional variants (7) BASE + StackFuse + KEY, (8) BASE + AddFuse + KEY, and (9) BASE + ParallelFuse + KEY. In addition to BASE, We also compare four other keyphrase-agnostic baselines as follows. (10) PT-NET, the original pointergenerator network (See et al., 2017) , which are widely used in text summarization and headline generation tasks. (11) SEASS (Zhou et al., 2017b), the GRU-based (Cho et al., 2014) sequence-to-sequence model with selective encoding mechanism, which is widely used in text summarization. (12) Transformer + Copy (Vaswani et al., 2017;Gu et al., 2016), which has the same architecture hyperparameters as BASE, the only difference is that it does not use BERT to initialize the encoder. (13) BASE + Diverse, which applies diverse decoding (Li et al., 2016b) in beam search to BASE during inference to improve the generation diversity for multiple headlines generation. To sum up, there are a total of 13 models to compare. Implementation and Hyperparameters. The encoder and the decoder of BASE have the same architecture hyperparameters as BERT-base. All the variants of keyphrase-aware headline generation models also have the same architecture hyperparameters as BASE. The only difference among them is their   Figure 2. We follow the same training strategy in (Vaswani et al., 2017) for model training. The implementations of PT-NET 4 and SEASS 5 are based on their open-source code.

Multi-Headline Generation
In this experiment, we only give models the news articles without the golden keyphrases. We use SLOT to generate top-K keyphrases for each article. Then each keyphrase-aware generation model using them to generates K different keyphrase-relevant headlines. For keyphrase-agnostic baselines, we apply the beam search to generate top k headlines for each article. We also apply the diverse decoding to BASE as a strong baseline (BASE + Diverse) for further comparison. The diversity penalty is set to be 1.0. It should be noted that we can also apply the diverse decoding to our keyphrase-aware models to further improve diversity.
Metrics. Following (Li et al., 2016a), we use Distinct-1 and Distinct-2 (the higher, the better) to evaluate diversity, which reports the degree of the diversity by calculating the number of distinct unigrams and bigrams in generated headlines for each article. Since randomly generated headlines are also highly diverse, we measure the quality as well. As we discussed in Section 1, one news article can have multiple reasonable headlines. However, each article in our test set has only one human written headline, which may only focus on one keyphrase of the news article. We should emphasize that there may be only one generated headline that focuses on the same keyphrase of the human-written headline, while others focus on distinct keyphrases. It is thus not reasonable if we use the same human-written headline as the ground-truth to evaluate all generated headlines. We assume that if the headlines generated by the model are high-quality and diverse, there would be a higher probability that one of the headlines is closer to the single ground-truth. Therefore, we report the highest ROUGE score among the multiple generated headlines for each article. This criterion is similar to top-K errors (He et al., 2016) in image classification tasks. We report the results of different K (K=1, 3, and 5).
Results. Table 2 presents the results. For diversity, we can see that all of our keyphrase-aware generation models significantly outperform than other keyphrase-agnostic baselines in both Distinct-1 and Distinct-2 metrics for all K. After using the diverse decoding, BASE + Diverse achieves higher diversity. Nevertheless, its diversity is still lower than most keyphrase-aware generation models. As expected, BASE + Filter achieves the highest diversity, and BASE + KEY achieves the lowest diversity among the variants of our keyphrase-aware generation models. For quality, except BASE, there is still a big gap between other keyphrase-agnostic baselines and our keyphrase-aware generation models. Except for BASE + Filter, all of our keyphrase-aware generation models achieve higher ROUGE scores than BASE and BASE + Diverse (see the last 6 lines in Table 2). These results show that our keyphrase-aware generation models can effectively generate high-quality and diverse headlines.

News Article Retrieval
To further evaluate the quality and diversity of the generation, we design an experiment that uses a search engine to help us verify the diversity and quality of the generated headlines. It should be noted that the main purpose of this experiment is not to improve the performance of the search engine but to measure the quality and diversity of the generated multiple headlines through a real-world search engine. We first collect the data pairs of the news article and its related user search query < X, Q > in the following way. If the article X is returned by the search engine based on a user query Q and the user clicks on the article X, then we take the query and the article as a data pair < X, Q >. After collection, each article X in the test set has 10 different related user queries on average. The article X is used as the ground-truth for Q in the following evaluation. We replace the search key in the original search engine for each article with the K generated multi-headlines. Also, we re-build the indexes of the search engine that contains 10,000 news articles in the test set. Then we re-use the user search queries to retrieve the article. We believe that if the generated multi-headlines have high diversity and quality, then given different user queries, there should be a high probability that the golden article can be retrieved.
Metrics. We use the mean average precision (mAP), which is widely used for information retrieval as a metric. We report the results of mAP@N (N =1, 3, 5, and 10), which test the average probability that the golden article is ranked by the search engine to the top N . Using the human-written headline (HUMAN) as the search key is evaluated as a strong baseline. We also compare the performance of using a different number of headlines (K=1, 3, and 5). It should be noted that increasing the number of headlines as the search key does not ensure the improvement of the mAP, because the number of search keys of all other articles will also be increased. If the generated multi-headlines are not good enough, it will introduce noise and even cause mAP to decrease.
Results. Table 3 presents the results. Similarly, our models perform much better than other keyphraseagnostic baselines. BASE + Diverse outperforms BASE, but still performs worse than 7 keyphrase-aware generation models (see the last 7 lines in Table 3). Generally, with the number of headline K increases, we can see that the performance of our keyphrase-aware generation models improves much higher than other baselines. We find that the mAP@10 of BASE + ParallelFuse (K=5) achieves 76.08, which is even better than HUMAN. These results demonstrate that our keyphrase-aware generation models can generate high-quality and diversity headlines.

Human Evaluation
At a similar level of diversity, we want to know the quality of the headlines generated by our keyphraseaware headline generation model compared to BASE + Diverse. Since the Distinct-1 and Distinct-2 of BASE + Diverse and BASE + StackFuse are close, we compare the quality of them through human evaluation. We randomly sample 100 articles from the test set, let each model generate 3 different headlines. We also mix a random headline and the golden headline for each article, and thus each article has 8 headlines. Three experts are asked to judge whether each headline could be used as the headline of the news. If more than two experts believe that it can be used as the headline of the news, then this headline is considered qualified. The results of the qualified rate of golden, BASE + Stack, BASE + Diverse, and random are 91.8%, 62.6%, 36.2%, and 0.0%, respectively. These results show that the quality of BASE + StackFuse is also higher than BASE + Diverse. We present some examples for comparison, as shown in Figure 3.

Related Works
News headline generation is a subtask of summarization which has been extensively studied recently (Rush et al., 2015;Takase et al., 2016;Ayana et al., 2016;Tan et al., 2017;Zhou et al., 2017b;Higurashi et al., 2018;Zhang et al., 2018;Murao et al., 2019). (Rush et al., 2015) propose an attention-based neural network for headline generation. (Takase et al., 2016) propose a method based on encoder-decoder architecture and design an AMR encoder for headline generation. To take evaluation metrics into consideration, (Ayana et al., 2016) apply the minimum risk training method to the generation model. (Tan et al., 2017) propose a coarse-to-fine method, which first extracts the salient sentences and then generates the headline based on Article#1 the mountain east conference -which says farewell to two members after this season -announced two new members thursday, one full member and one associate member. davis & elkins and the university of north carolina at pembroke will join frostburg state as new conference members. … [7 sentences with 151 words are abbreviated from here.] two original conference members will depart after the 2018-19 season. shepherd will move to the pennsylvania state athletic conference and uva-wise will move to the south atlantic conference. member school wheeling jesuit is adding football and will play a full schedule next season. when unc pembroke joins, the conference will have 12 football programs. one of the fbi's 10 most wanted was shot and killed during an incident involving apex police and the fbi on wednesday. the fbi and apex police were at woodspring suites, located at 901 lufkin road in apex, after following a tip concerning a fugitive. … [6 sentences with 129 words are abbreviated from here.] according to officials carlson posted a bond and fled to mount pleasant in south carolina. he was placed on the fbi's list of top ten fugitives in september 2018. the medical examiner will need to positively identify carlson. "the fbi is grateful to our partners with the apex police department for the assistance," the fbi said. these sentences. (Zhang et al., 2018) propose a method for question headline generation, which designs a dual-attention seq2seq model. However, most previous headline generation methods focus on one-to-one mapping, and the headline generation process is not controllable. In this work, we focus on the news multi-headline generation problem and design a keyphrase-aware headline generation method. Different information aware methods have been successfully used in natural language generation tasks (Zhou et al., 2017a;Zhou et al., 2018;Wang et al., 2017), such as responses generation in the dialogue system. (Zhou et al., 2017a) propose a mechanism-aware seq2seq model for controllable response generation. (Zhou et al., 2018) propose a commonsense knowledge aware conversation generation method. (Wang et al., 2017) propose an encoder-decoder based neural network for response generation. To our best knowledge, we are the first to consider keyphrase-aware mechanism on news headline generation and build the first keyphrase-aware news headline corpus.

Conclusion
In this paper, we demonstrate how to enable news headline generation systems to be aware of keyphrases such that the model can generate diverse news headlines in a controlled manner. We also build a first large-scale keyphrase-aware news headline corpus, which is based on mining the keyphrases of users' interests in news articles with user queries. Moreover, we propose a keyphrase-aware news multi-headline generation model that contains a multi-source Transformer decoder with three variants of attention-based fusing mechanism. Extensive experiments on the real-world dataset show that our approach can generate high-quality, keyphrase-relevant, and diverse news headlines, which outperforms many strong baselines.