A Large-Scale Multi-Length Headline Corpus for Analyzing Length-Constrained Headline Generation Model Evaluation

Browsing news articles on multiple devices is now possible. The lengths of news article headlines have precise upper bounds, dictated by the size of the display of the relevant device or interface. Therefore, controlling the length of headlines is essential when applying the task of headline generation to news production. However, because there is no corpus of headlines of multiple lengths for a given article, previous research on controlling output length in headline generation has not discussed whether the system outputs could be adequately evaluated without multiple references of different lengths. In this paper, we introduce two corpora, which are Japanese News Corpus (JNC) and JApanese MUlti-Length Headline Corpus (JAMUL), to confirm the validity of previous evaluation settings. The JNC provides common supervision data for headline generation. The JAMUL is a large-scale evaluation dataset for headlines of three different lengths composed by professional editors. We report new findings on these corpora; for example, although the longest length reference summary can appropriately evaluate the existing methods controlling output length, this evaluation setting has several problems.


Introduction
The news media publish newspapers in print forms and in electronic forms. In the electric forms, articles might be read on various types of devices using any application; thus, news media companies have an increasing need to produce multiple headlines for the same news article on the basis of what would be most appropriate and most compelling on an array of devices. Every device and application used for viewing articles has a strict * This work was done at Retrieva within Project.
On 18th Toyota announced that it will set the model of only engine cars to zero by about 2025.. . . They set "electric vehicle" which is Hybrid Vehicle (HV), Plug-in Hybrid Vehicle (PHV), and Fuel Cell Vehicle (FCV) to all models.. . .

トヨタ、 エンジン車だけの車種ゼ ロへ 2025年ごろ
Toyota sets the number of models with only engine cars to zero by about 2025. Table 1: An example of four headlines for the same article that were created by professional editors. In this example, '電動車'(Electric cars) and '全'(all) are represented by red letters and are not included in the 24character headline. These tokens cannot be evaluated by 24-character headlines. The blue tokens are not included in 9-and 13-character headlines. These tokens should not be included in shorter headlines. upper bound regarding the number of characters allowed because of limitations in the space where the headline appears. The technology of automatic headline generation has the potential to contribute greatly to this domain, and the problems of news headline generation have motivated a wide range of researches (Wang et al., 2018;Chen et al., 2018;Li et al., 2018;Song et al., 2018;Kiyono et al., 2018;. Table 1 shows the sample headlines in three different lengths written by professional editors of a arXiv:1903.11771v2 [cs.CL] 29 Mar 2019 media company for the same news article: The first headline for the digital media is restricted the length to 10 characters, the second is to 13 characters, and the third is to 26 characters. From a practical perspective, headlines must be generated under a rigid length constraint. However, few studies have been performed based on this assumption. The first study to consider the length of system outputs in the context of encoder-decoder language generation was Rush et al. (2015). This study controlled the length of an output sequence by reducing the score of the end-of-sentence token to −∞ until the method generated the desired number of words. Subsequently, Kikuchi et al. (2016) and Fan et al. (2018) have proposed mechanisms for length control; however, these studies produced summaries of 30, 50, and 75 bytes, and the researches evaluated them by using the reference summaries of a single length (approximately 75 bytes long) in DUC 2004 1 . Thus, some questions can be posed: (1) Can longer length references adequately evaluate system outputs shorter than the reference to some extent? (2) How do the words not included in shorter references but included in longer references affect the evaluation?
(3) What type of tasks influence each length limit? and (4) How do the existing length control methods manage those tasks? In this study, we present novel corpora to investigate these research questions. The contributions of this study are threefold.
1. We release the Japanese News Corpus (JNC) 2 , which includes 1.83 million pairs of headlines and the lead three sentences of Japanese news articles. We expect this corpus to provide common supervision data for headline generation.
2. We build the JApanese MUlti-Length Headline Corpus (JAMUL) 2 for the evaluation of headlines of different lengths. In this novel dataset, each news article is associated with multiple headlines of three different lengths.
3. We report new findings on the JAMUL; for example, although the longer reference seems to be able to evaluate the short system output, we also found a problem with this evaluation setting. Additionally, we clarified 1 https://duc.nist.gov/duc2004/ 2 https://cl.asahi.com/api_data/ jnc-jamul-en.html what type of tasks the existing method solves according to the length.

Headlines composed by a media company
Before describing the JNC and JAMUL in detail, we explain the process where a media company composes headlines for a news article. First, reporters write an article and submit it to the editorial department to be published in the newspaper. The editorial department writes a headline for the article dedicated to print media. We call these headlines print headlines or lengthinsensitive headlines hereafter.
In addition to print headlines, digital media editors, who are typically not the same editors for print, pick up those they want to distribute on digital media from the articles submitted for print and compose three different headlines. The first headline for the digital signage and audio media has a limit of up to 10 characters. This type of headline is appended to the beginning of a concise summary of the article so that readers can understand the news at first glance. The second type of headline is produced for portable telephones with small LCDs and small areas on the news site (e.g., the access ranking); the upper limit of the number of characters is 13. The third type of headline is produced for PC news websites, and the upper limit of the number of characters is 26. This limit is derived from the layout of the news site. We refer to the three types of headlines as 10char-ref, 13charref, and 26char-ref (refer to Table 1 for example). We collectively call these headlines lengthsensitive headlines. Table 1 presents an example of headlines written for an article by the professional editors. We extract the JNC and JAMUL from the process of news production of trusted and professional sources maintained in databases with time series; therefore, they can be considered representative of contemporary editorial practice.

JNC
The JNC is a collection of 1,829,231 pairs of the three lead sentences of articles and their print headlines published from 2007 to 2016. Figure 1 (a) depicts the distribution of lengths of the headlines in the JNC. Lengths of headlines in the JNC are diverse because of various factors related to publishing newspapers (e.g., space limitation, importance of the news). The tendency is important articles tend to have longer print headlines assigned.
The JNC is useful for training headline generation models because it has many training instances. Furthermore, the corpus is suitable for training a model for variable-length headline generation because of the variety of the headline lengths.

JAMUL
The JAMUL is a corpus containing 1,524 news articles and their length-sensitive headlines of 10 characters, 13 characters, and 26 characters for digital media. All the articles and headlines were published between September 2017 and March 2018. The volume of the news articles may be insufficient for training a headline generation model. However, as Figure 1 (b) shows, the JAMUL includes length-sensitive headlines that strictly preserve the length requirements. This novel characteristic of the JAMUL is a test set for headline generation. No overlap of articles between the JNC and JAMUL is observed.   Table 3: Word-level precision and recall when comparing length-insensitive/sensitive headlines.

Comparing headlines with article bodies
What type of operation did the editors perform to create length-sensitive and length-insensitive headlines in the JAMUL? To clarify this question, we analyzed the proportions of the number of extractive and abstractive operations. Specifically, we reported the word-level precision and recall scores in Table 2, assuming that articles are "system" summaries and that 10char-ref, 13charref, and 26char-ref headlines are "reference" summaries. Notably, we removed blank spaces, which were the most common token in longer headlines. The relatively high recall score indicates that the most often required operations to generate headlines are extractive, and the abstractive operation is 10% of the total.

Comparing among length-sensitive headlines with print headlines
How similar are the headlines used for training (length-insensitive) and for evaluation (lengthsensitive)? We estimated the appropriateness of length-insensitive headlines as a "seed" for producing length-sensitive headlines. More concretely, we reported word-level precision and recall scores in Table 3, assuming that lengthinsensitive headlines are "system" summaries and that 10char-ref, 13char-ref and 26char-ref headlines are "reference" summaries. The relatively high recall scores indicate that the training and evaluation data are not so distant. Additionally, we found that the editors use a moderate number of words that do not appear in print headlines when composing length-sensitive headlines. Table 4 is an example of the typical differences between the length-insensitive and length-sensitive headlines. Comparing the 26-character headline On 1st the U.S. Facebook (FB) announced financial results from July to September in 2017 and archived the record quarterly amount of sales and net income thanks to its mobile advertising growth and other factors. . . . FB revealed to double the number of personnel engaged in preventing the fake news from spreading to about 20,000 in order to secure the safety on FB.
Headline for print media フェイスブック、 四半期で最高益 モバイル広告 好調 Facebook achieved the record quarterly profit thanks to its mobile advertising business.
Multi-length headlines for digital media 7 chars 米FBが最高益 (10char-ref) The U.S. FB achieved the record profit.

米フェイスブックが最高益
The U.S. Facebook achieved the record profit.

chars (26char-ref)
米フェイスブックが最高益 偽ニュ ース対策で要員倍増も The U.S. Facebook achieved the record profit and double the personnel for countermeasures against the fake news.  with the print headline, the choices of contents are different from each other; for example, while the print headline reports the reason about the record profit, the 26-character headline describes the topic with regards to the increasing number of personnel. Next, comparing the 7-character (10char-ref) headline with the print headline, we observe that the choices of words are different; the print headline uses "Facebook", which is changed to "FB" in 7-character headline.

Comparing length-sensitive headlines
How similar is the composition of headlines for a news article of different lengths? How good are 26char-ref headlines as "seeds" for generating 10char-ref or 13char-ref headlines? Is the simple strategy of trimming 26char-ref headlines to 10 or 13 characters sufficient? To answer these questions, we computed word-level precision and recall scores, assuming that 26char-ref headlines are "system" summaries and that 10char-ref and 13char-ref headlines are "reference" summaries.
The first and second rows of Table 5 represent the situation when we used 26char-ref headlines as they are and without preserving the length constraint. Although this setting was unrealistic, we could estimate the upper bound when we composed a shorter headline from a 26char-ref. The high recall scores indicate that 26char-ref headlines mostly cover the words included in 10charref and 13char-ref headlines. The third and fourth rows of Table 5 correspond to the strategy where we generated headlines in 10 and 13 characters from the first 10 and 13 characters of 26char-ref headlines. This strategy achieved moderate success for generating headlines in 13 characters but did not work well for headlines in 10 characters. In other words, we observed large differences between 10char-ref and 26char-ref headlines. The fifth and sixth rows of Table 5 correspond to the strategy where we generated headlines in 10 and 13 characters from the last 10 and 13 characters of 26char-ref headlines. Extracting the latter part of a 26char-ref headline was probably not a good idea because the precision and recall scores were much worse than those for the first 10 and 13 characters. On the other hand, these results also indicate that the words included in 10char-ref and 13char-ref are observed in the latter part of 26char-ref.
In sum, we found similarities in headlines of different lengths in the JAMUL. However, the simple strategy to trim a longer headline into a shorter headline is insufficient (except for shrinking 26char-ref headlines into 13char-ref headlines). Table 1 is an example of the typical differences among length-sensitive headlines. There is a little overlap between longer and shorter headlines because 9-and 13-character headline extract the shorter phrases which have the nearly same meaning as the 24-character headline. Focusing on "車 種" (models), the words are in the latter half of the 24-character headline, and we could confirm that important keywords are not always included at the beginning of the headlines.

Comparing headline generation methods on JAMUL
In this section, we explore a question about evaluation: how reliable is the conventional evaluation method using a single length summary for measuring the quality of summaries of different lengths?
In order to answer this question, we generate multiple summaries of different lengths by using the existing methods, and measure the correlation between the performance values computed by the conventional evaluation method and those computed on JAMUL.
3.1 Headline generation methods with the mechanism to control output length In this study, we explored four methods for headline generation that can control the output length. The first two methods, LenEmb and LenInit, were proposed by Kikuchi et al. (2016).
LenEmb provides the decoder with output length information in the form of the length embedding. LenInit controls the output length by multiplying the initial state of the decoder's memory cell by the desired length. Fan et al. (2018) also proposed a lengthcontrollable method for a convolutional sequenceto-sequence (ConvS2S) model (Gehring et al., 2017). Their method added special tokens indicating the range of the output length at the beginning of an input sequence. In our experiment, we used a special token to specify an output length 3 and called this method SP-token.
We also considered the method of LC , which extends ConvS2S and multiplies the initial state of the residual connection (He et al., 2016) by the desired number of output tokens. In the experiment, we set the desired number of characters instead of that of tokens.

Datasets and evaluation protocol
We trained the six methods for headline generation on the JNC. We removed instances that were duplicated or unsuitable for training a headline generation model 4 . The filtering step obtained 1,568,360 pairs of newspaper articles and headlines. We randomly selected 1% of the instances (15,546 pairs) as a validation set and used the remainder (1,523,469 pairs) as a training set. We used Byte Pair Encoding (BPE) 5 (Sennrich et al., 2016) for tokenization. We set the merge operation to 8,000 and pretokenized all the data by MeCab. Finally, we obtained 11,257 tokens for both sides. When training a model, we set the length of each reference headline to the model. When generating headlines in the evaluation, we set the output lengths to 10, 13 and 26 characters; each output was evaluated by the reference that had the same length in the JAMUL. We evaluated all models by using three variants of ROUGE (Lin, 2004) recall metric 6 : ROUGE-1, ROUGE-2, and ROUGE-L. Headlines exceeding the length limits were trimmed for the fairness of the evaluation.

Implementation
We employed OpenNMT 7 (Klein et al., 2017) for Seq2Seq and fairseq 8 for ConvS2S and Transformer. We extended the implementations to realize LenEmb, LenInit and LC. We set the dimensions for token and length embeddings to 512, those for hidden states to 512, and the beam width to 5. These parameters are common in all the models; Table 6 summarizes other parameters specific to each sequence-to-sequence model. We used Nesterov's accelerated gradient method (NAG) (Sutskever et al., 2013) with a momentum of 0.99 in ConvS2S. In Transformer, we set the number of attention heads to 8, the dimensions for the feed-forward network to 2048, Adam's β to 0.98, warm up steps to 4000, and label smoothing to 0.1.  3.4 Evaluating multi-length headlines generated by methods on the JAMUL Table 7 presents ROUGE scores of each method on the JAMUL. Transformer + SP-token was the clear winner in all lengths and evaluation metrics on this dataset. Additionally, the three methods with SP-token outperformed the others except for R-2 and R-L on 26 characters (Seq2Seq + LenInit was better than ConvS2S + SP-token).
What if we do not have multiple headlines of different lengths to evaluate the methods? To answer this question, we followed the evaluation setup of the previous studies on DUC 2004: the reference summaries of 75 bytes were used even when evaluating summaries of 30 and 50 bytes. Table 8 reports ROUGE scores for the system outputs in 10 and 13 characters evaluated based on the 26char-ref headlines. This evaluation setup reduced the performance differences between the methods. Although Transformer + SPtoken remained the clear winner, the performance of Seq2Seq + SP-token in 13 characters was now lower than that of Seq2Seq + LenInit.
Thus, we computed rank correlation coefficients (Kendall's τ ) to assess the discrepancy in the ranking among the methods presented by Tables 7 and 8. Additionally, we computed Kendall's τ when using the first 10 or 13 characters in 26char-ref headlines as a reference (26char-trim). Table 9 reveals that the rank correlation is not perfect (lower than one) but moderate: there is a possibility that an order of the scores of two methods may flip depending on the evaluation setup. This result is similar to Shapira et al. (2018), that is, the validity for the evaluation setting to use single length reference in multidocument summarization.

Performance of word selection according to output length
How well do the existing methods change the word selection depending on the output length? As shown in the first and second rows of Table 5, 10char-ref and 13char-ref headlines contain words that are not included in 26char-ref headlines. In other words, the selection of words in the generated headline should be changed in response to the length restriction. To confirm this question, we computed word-level precision and recall scores for the system outputs generated by each method, assuming that the groups of the words included in 10char-ref or 13char-ref but not in 26char-ref headlines are the "reference" summaries. For instance, the red words in Table 1 are the "reference" summaries in this experiment. We report this result in Table 10. The low recall score indicates that each system cannot select the words tailored to the length constraints. The difference of the precision scores between the models is small. We infer that there is almost no difference between the existing methods in terms of the word selection specialized for the length constraint.

Performance of managing extractive and abstractive tasks
In Table 2, we reported the proportion of the number of extractive and abstractive operations in JA-MUL. We analyze how the existing methods can reflect extractive and abstractive operations in generating summaries. First, to observe extractive operations, we computed word-level precision and recall scores for the system outputs generated by each system, measuring the number of overlapping words between an article as "system" summaries and 10char-ref, 13char-ref, and 26char-ref headlines as "reference" summaries. Table 11 reports the result. The relatively high recall score indicates that the length control method succeeds in managing extractive operations.
Next, we examine whether the length control methods could perform abstractive operations. We adopted the words included in 10char-ref, 13charref, or 26char-ref headlines but not included in an article as "reference" summaries, and computed the precision and recall scores for the system outputs (Table 12). Regarding the outputs targeting at 26 characters, the recall scores of around 40% imply that each model can manage abstractive operations to some extent. In contrast, the low recall scores for the outputs targeting at 10 and 13 characters revealed that all length control methods could not perform well on abstractive operations under the severe length constraint.  Table 7: ROUGE scores of each model on the JAMUL. Specified lengths are 10, 13 and 26 characters. R-1, R-2, and R-L represent ROUGE-1, ROUGE-2, and ROUGE-L, respectively. Note that (1) to (6) in Table 8 and after that represent the model (1) to (6) Table 9: Kendall's rank correlation coefficients (τ ) between Table 7 and Table 8. The part left of the slash represents the specified output length, and the part right of the slash presents the reference headlines.

How do length control mechanisms work?
We wonder that a method that could control the output length would produce similar headlines even for different lengths for the same news article. To confirm this suspicion, we reported ROUGE-1 recall scores in Figure 2 with three different configurations: (a) evaluating the first 13 characters of headlines generated to be 26 characters long on 13char-ref headlines (yellow); (b) evaluating headlines generated to be 13 characters long on 13char-ref headlines (green); and (c) is the same as (a) but evaluated on the headline generated to be 13 characters long (blue). Setting (a) corresponds to the strategy where we trimmed headlines of different lengths to 13 characters long. This setting was worse than setting (b), where a method tailored headlines to the desired length. However, the difference in ROUGE scores between (a) and (b) was not so large, indi-   Table 11: Word-level precision and recall scores comparing the system outputs and the overlap words between the words in an article and the words in 10charref, 13char-ref, and 26char-ref. cating that the existing methods do not drastically change the content for 13 characters long and 26 characters long. This tendency was also verified by setting (c), which assessed how much the first 13 characters of headlines generated to be 26 characters long covered the content of those generated to be 13 characters long. These facts suggest that we should explore a method not only trained by generic supervision data (print headlines) but also tuned for the desired length in further research. Rush et al. (2015) created the first approach to neural abstractive summarization. They generated a headline from the first sentence of a news article in the Annotated English Gigaword corpus (Napoles et al., 2012), which contains   an enormous number of pairs of headlines and articles. After their study, a number of researchers addressed this task: for example, Chopra et al. (2016) used the encoder-decoder framework (Sutskever et al., 2014;Bahdanau et al., 2015) and Nallapati et al. (2016) incorporated additional features into the model, such as parts-ofspeech tags and named entities. Suzuki and Nagata (2017) proposed word-frequency estimation to reduce the repeated phrases being generated. Zhou et al. (2017) proposed a gating mechanism (sGate) to ensure that important information is selected at each decoding step. Unfortunately, attempts to control the output length in neural abstractive summarization have been limited. Shi et al. (2016) reported that hidden states in recurrent neural networks in the encoderdecoder framework could implicitly model the length of output sequences. Kikuchi et al. (2016) was the first to propose the idea of controlling the output length in the encoder-decoder framework. Their approach inserts embeddings for the output length into the decoder. Additionally, Fan et al. (2018) reported that output lengths could be con-trolled by embeddings of special tokens given to an input sequence. These two studies used DUC 2004 (Over et al., 2007), which comprises only 75-byte summaries, to evaluate the outputs in multiple lengths.  also proposed a method to control the number of output words in the ConvS2S model. However, no previous work built a dataset for evaluating headlines of multiple lengths nor reported an in-depth perspective on this task along the process of new production in the real world. On the other hand, a single length reference that could appropriately evaluate multiple length summaries in multiple document summarization was reported (Shapira et al., 2018). In that study, they confirmed the correlation coefficient of ROUGE-1 scores between the scores using a single length reference and multiple (gold) length references in the evaluation. Our research differed in that we examined why strong correlation occurs and studied headline generation domain, which requires stricter keyword selection.

Conclusion
In this paper, we presented two new corpora: The JNC contains a large number of pairs of news articles and their headlines, and the JAMUL includes headlines of three different lengths (10, 13, and 26 characters long) written by professional editors. This study is the first to analyze the characteristics of multiple headlines of different lengths and to evaluate existing approaches for length control based on the reference headlines composed for different lengths. We found that Transformer model with a special length token (SP-token) outperformed the other methods on the JAMUL. Additionally, while we confirmed that single length (the longest) references could adequately evaluate multiple length system outputs, the existing methods cannot take into account the word selection according to length constraint. We also found it difficult to evaluate methods to control output length because headlines of different lengths are written based on different goals and because the training data does not necessarily reflect the goal of the headlines of a specific length. In future, we plan to explore an approach to adapt a model trained on the print headlines to those which dedicated to a different length.