Event-Driven Headline Generation

We propose an event-driven model for headline generation. Given an input document, the system identiﬁes a key event chain by extracting a set of structural events that describe them. Then a novel multi-sentence compression algorithm is used to fuse the extracted events, generating a headline for the document. Our model can be viewed as a novel combination of extractive and abstractive headline generation, combining the advantages of both methods using event structures. Standard evaluation shows that our model achieves the best performance compared with previous state-of-the-art systems.


Introduction
Headline generation (HG) is a text summarization task, which aims to describe an article (or a set of related paragraphs) using a single short sentence. The task is useful in a number of practical scenarios, such as compressing text for mobile device users (Corston-Oliver, 2001), generating table of contents (Erbs et al., 2013), and email summarization (Wan and McKeown, 2004). This task is challenging in not only informativeness and readability, which are challenges to common summarization tasks, but also the length reduction, which is unique for headline generation.
Both consist of two steps: candidate extraction and headline generation. Extractive models choose a set of salient sentences in candidate extraction, and then exploit sentence compression techniques to achieve headline generation (Dorr et al., 2003;

Multi-Sentence Compression
Headline Candidate Extraction Headline Generation Figure 1: System framework. Zajic et al., 2005). Abstractive models choose a set of informative phrases for candidate extraction, and then exploit sentence synthesis techniques for headline generation (Soricut and Marcu, 2007;Woodsend et al., 2010;Xu et al., 2010). Extractive HG and abstractive HG have their respective advantages and disadvantages. Extractive models can generate more readable headlines, because the final title is derived by tailoring human-written sentences.
However, extractive models give less informative titles (Alfonseca et al., 2013), because sentences are very sparse, making high-recall candidate extraction difficult. In contrast, abstractive models use phrases as the basic processing units, which are much less sparse. However, it is more difficult for abstractive HG to ensure the grammaticality of the generated titles, given that sentence synthesis is still very inaccurate based on a set of phrases with little grammatical information (Zhang, 2013).
In this paper, we propose an event-driven model for headline generation, which alleviates the disadvantages of both extractive and abstractive HG. The framework of the proposed model is shown in Figure 1.
In particular, we use events as the basic processing units for candidate extraction. We use structured tuples to represent the subject, predicate and object of an event. This form of event representation is widely used in open information extraction (Fader et al., 2011;Qiu and Zhang, 2014). Intuitively, events can be regarded as a trade-off between sentences and phrases. Events are meaningful structures, containing necessary grammatical information, and yet are much less sparse than sentences. We use salience measures of both sentences and phrases for event extraction, and thus our model can be regarded as a combination of extractive and abstractive HG.
During the headline generation step, A graphbased multi-sentence compression (MSC) model is proposed to generate a final title, given multiple events. First a directed acyclic word graph is constructed based on the extracted events, and then a beam-search algorithm is used to find the best title based on path scoring.
We conduct experiments on standard datasets for headline generation.
The results show that headline generation can benefit not only from exploiting events as the basic processing units, but also from the proposed graph-based MSC model. Both our candidate extraction and headline generation methods outperform competitive baseline methods, and our model achieves the best results compared with previous state-of-the-art systems.

Background
Previous extractive and abstractive models take two main steps, namely candidate extraction and headline generation. Here, we introduce these two types of models according to the two steps.

Extractive Headline Generation
Candidate Extraction. Extractive models exploit sentences as the basic processing units in this step. Sentences are ranked by their salience according to specific strategies (Dorr et al., 2003;Erkan and Radev, 2004;Zajic et al., 2005). One of the stateof-the-art approaches is the work of Erkan and Radev (2004), which exploits centroid, position and length features to compute sentence salience. We re-implemented this method as our baseline sentence ranking method. In this paper, we use SentRank to denote this method.
Headline Generation. Given a set of sentences, extractive models exploit sentence compression techniques to generate a final title. Most previous work exploits single-sentence compression (SSC) techniques. Dorr et al. (2003) proposed the Hedge Trimmer algorithm to compress a sentence by making use of handcrafted linguistically-based rules. Alfonseca et al. (2013) introduce a multi-sentence compression (MSC) model into headline generation, using it as a baseline in their work. They indicated that the most important information is distributed across several sentences in the text.

Abstractive Headline Generation
Candidate Extraction. Different from extractive models, abstractive models exploit phrases as the basic processing units. A set of salient phrases are selected according to specific principles during candidate extraction (Schwartz, 01;Soricut and Marcu, 2007;Xu et al., 2010;Woodsend et al., 2010). Xu et al. (2010) propose to rank phrases using background knowledge extracted from Wikipedia. Woodsend et al. (2010) use supervised models to learn the salience score of each phrase. Here, we use the work of Soricut and Marcu (2007) , namely PhraseRank, as our baseline phrase ranking method, which is an unsupervised model without external resources. The method exploits unsupervised topic discovery to find a set of salient phrases.
Headline Generation. In the headline generation step, abstractive models exploit sentence synthesis technologies to accomplish headline generation. Zajic et al. (2005) exploit unsupervised topic discovery to find key phrases, and use the Hedge Trimmer algorithm to compress candidate sentences. One or more key phrases are added into the compressed fragment according to the length of the headline. Soricut and Marcu (2007) employ WIDL-expressions to generate headlines. Xu et al. (2010) employ keyword clustering based on several bag-of-words models to construct a headline. Woodsend et al. (2010) use quasi-synchronous grammar (QG) to optimize phrase selection and surface realization preferences jointly.

463
Similar to extractive and abstractive models, the proposed event-driven model consists of two steps, namely candidate extraction and headline generation.

Candidate Extraction
We exploit events as the basic units for candidate extraction. Here an event is a tuple (S, P, O), where S is the subject, P is the predicate and O is the object. For example, for the sentence "Ukraine Delays Announcement of New Government", the event is (Ukraine, Delays, Announcement). This type of event structures has been used in open information extraction (Fader et al., 2011), and has a range of NLP applications (Ding et al., 2014;Ng et al., 2014).
A sentence is a well-formed structure with complete syntactic information, but can contain redundant information for text summarization, which makes sentences very sparse. Phrases can be used to avoid the sparsity problem, but with little syntactic information between phrases, fluent headline generation is difficult. Events can be regarded as a trade-off between sentences and phrases. They are meaningful structures without redundant components, less sparse than sentences and containing more syntactic information than phrases.
In our system, candidate event extraction is performed on a bipartite graph, where the two types of nodes are lexical chains (Section 3.1.2) and events (Section 3.1.1), respectively. Mutual Reinforcement Principle (Zha, 2002) is applied to jointly learn chain and event salience on the bipartite graph for a given input. We obtain the top-k candidate events by their salience measures.

Extracting Events
We apply an open-domain event extraction approach.
Different from traditional event extraction, for which types and arguments are predefined, open event extraction does not have a closed set of entities and relations (Fader et al., 2011). We follow Hu's work (Hu et al., 2013) to extract events.
Given a text, we first use the Stanford dependency parser 1 to obtain the Stanford typed dependency structures of the sentences (Marneffe and Manning, 2008).
Then we focus on  Figure 2: Dependency tree for the sentence "the Keenans could demand the Aryan Nations' assets".
two relations, nsubj and dobj, for extracting event arguments. Event arguments that have the same predicate are merged into one event, represented by tuple (Subject, Predicate, Object). For example, given the sentence, "the Keenans could demand the Aryan Nations' assets", Figure  2 present its partial parsing tree. Based on the parsing results, two event arguments are obtained: nsubj(demand, Keenans) and dobj(demand, assets). The two event arguments are merged into one event: (Keenans, demand, assets).

Extracting Lexical Chains
Lexical chains are used to link semanticallyrelated words and phrases (Morris and Hirst, 1991;Barzilay and Elhadad, 1997). A lexical chain is analogous to a semantic synset. Compared with words, lexical chains are less sparse for event ranking.
Given a text, we follow Boudin and Morin (2013) to construct lexical chains based on the following principles: 1. All words that are identical after stemming are treated as one word; 2. All NPs with the same head word fall into one lexical chain; 2 3. A pronoun is added to the corresponding lexical chain if it refers to a word in the chain (The coreference resolution is performed using the Stanford Coreference Resolution system); 3 4. Lexical chains are merged if their main words are in the same synset of WordNet. 4 At initialization, each word in the document is a lexical chain. We repeatedly merge existing chains by the four principles above until convergence.
In particular, we focus on content words only, including verbs, nouns and adjective words. After the merging, each lexical chain represents a word cluster, and the first occuring word in it can be used as the main word of chain.

Learning Salient Events
Intuitively, one word should be more important if it occurs in more important events. Similarly, one event should be more important if it includes more important words. Inspired by this, we construct a bipartite graph between lexical chains and events, shown in Figure 3, and then exploit MRP to jointly learn the salience of lexical chains and events. MRP has been demonstrated effective for jointly learning the vertex weights of a bipartite graph (Zhang et al., 2008;Ventura et al., 2013).
Given a text, we construct bipartite graph between the lexical chains and events, with an edge being constructed between a lexical chain and an event if the event contains a word in the lexical chain. Suppose that there are n events {e 1 , · · · , e n } and m lexical chains: {l 1 , · · · , l m } in the bipartite graph G bi . Their scores are represented by sal(e) = {sal(e 1 ), · · · , sal(e n )} and sal(l) = {sal(l 1 ), · · · , sal(l m )}, respectively. We compute the final sal(e) and sal(l) iteratively by MRP. At each step, sal(e i ) and sal(l j ) are computed as follows: where r ij ∈ R denotes the cohesion between lexicon chain l i and event e j , A is a normalization factor, sal(·) denotes the salience, and the initial values of sal(e) and sal(t) can be assigned randomly.
The remaining problem is how to define the salience score of a given lexicon chain l i and a given event e j . In this work, we use the guidance of abstractive and extractive models to compute Lexical Chains Events Figure 3: Bipartite graph where two vertex sets denote lexical chains and events, respectively.
sal(l j ) and sal(e i ), respectively, as shown below: where sal abs (·) denotes the word salience score of an abstractive model, sal ext (·) denotes the sentence salience score of an extractive model, and Sen(e i ) denotes the sentence set where e i is extracted from. We exploit our baseline sentence ranking method, SentRank, to obtain the sentence salience score, and use our baseline phrase ranking method, PhraseRank, to obtain the phrase salience score.

Headline Generation
We use a graph-based multi-sentence compression (MSC) model to generate the final title for the proposed event-driven model. The model is inspired by Filippova (2010). First, a weighted directed acyclic word graph is built, with a start node and an end node in the graph. A headline can be obtained by any path from the start node to the end node. We measure each candidate path by a scoring function. Based on the measurement, we exploit a beam-search algorithm to find the optimum path.

Word-Graph Construction
Given a set of candidate events CE, we extract all the sentences that contain the events. In particular, we add two artificial words, S and E , to the start position and end position of all sentences, respectively. Following Filippova (2010), we extract all words in the sentences as graph vertexes, and then construct edges based on these words. Filippova (2010)   for all the word pairs that are adjacent in one sentence. The title generated using this strategy can mistakenly contain common word bigrams( i.e. adjacent words) in different sentences. To address this, we change the strategy slightly, by adding edges for all word pairs of one sentence in the original order. In another words, if word w j occurs after w i in one sentence, then we add an edge w i → w j for the graph. Figure 4 gives an example of the word graph. The search space of the graph is larger compared with that of Filippova (2010) because of more added edges. Different from Filippova (2010), salience information is introduced into the calculation of the weights of vertexes. One word that occurs in more salient candidate should have higher weight. Given a graph G = (V, E), where V = {V 1 , · · · , V n } denotes the word nodes and E = {E ij ∈ {0, 1}, i, j ∈ [1, n]} denotes the edges. The vertex weight is computed as follows: where sal(e) is the salience score of an event from the candidate extraction step, V i .w denotes the word of vertex V i , and dist(w, e) denotes the distance from the word w to the event e, which are defined by the minimum distance from w to all the related words of e in a sentence by the dependency path 5 between them. Intuitively, equation 3 demonstrates that a vertex is salient when its corresponding word is close to salient events. It is worth noting that the formula can adapt to extractive and abstractive models as well, by replacing events with sentences and phrases. We use them for the SentRank and PhraseRank baseline systems in Section 4.3, respectively.
The equation to compute the edge weight is adopted from Filippova (2010): where w (E ij ) refers to the sum of rdist(V i .w, V j .w) over all sentences, and rdist(·) denotes the reciprocal distance of two words in a sentence by the dependency path. By the formula, an edge is salient when the corresponding vertex weights are large or the corresponding words are close.

Scoring Method
The key to our MSC model is the path scoring function. We measure a candidate path based on two aspects. Besides the sum edge score of the path, we exploit a trigram language model to compute a fluency score of the path. Language models have been commonly used to generate more readable titles. The overall score of a path is compute by: where p is a candidate path and the corresponding word sequence of p is w 1 · · · w n . A trigram language model is trained using SRILM 6 on English Gigaword (LDC2011T07).

Beam Search
Beam search has been widely used aiming to find the sub optimum result (Collins and Roark, 2004;Zhang and Clark, 2011), when exact inference is extremely difficult. Assuming our word graph has a vertex size of n, the worst computation complexity is O(n 4 ) when using a trigram language model, which is time consuming.  Using beam search, assuming the beam size is B, the time complexity decreases to O(Bn 2 ).
Pseudo-code of our beam search algorithm is shown in Figure 5. During search, we use candidates to save a fixed size (B) of partial results. For each iteration, we generate a set of new candidates by adding one vertex from the graph, computing their scores, and maintaining the top B candidates for the next iteration. If one candidate reaches the end of the graph, we do not expand it, directly adding it into the new candidate set according to its current score. If all the candidates reach the end, the searching algorithm terminates and the result path is the candidate from candidates with the highest score.

Settings
We use the standard HG test dataset to evaluate our model, which consists of 500 articles from DUC-04 task 1 7 , where each article is provided with four reference headlines. In particular, we use the first 100 articles from DUC-07 as our development set. There are averaged 40 events per article in the two datasets. All the pre-processing steps, including POS tagging, lemma analysis, dependency parsing and anaphora resolution, are conducted using the Stanford NLP tools (Marneffe and Manning, 2008). The MRP iteration number is set to 10.
We use ROUGE (Lin, 2004) to automatically measure the model performance, which has been widely used in summarization tasks (Wang et al., 2013;Ng et al., 2014). We focus on Rouge1 and Rouge2 scores, following Xu et al. (2010). In addition, we conduct human evaluations, using the same method as Woodsend et al. (2010). Four participants are asked to rate the generated headlines by three criteria: informativeness (how much important information in the article does the headline describe?), fluency (is it fluent to read?) and coherence (does it capture the topic of article?). Each headline is given a subjective score from 0 to 5, with 0 being the worst and 5 being the best. The first 50 documents from the test set and their corresponding headlines are selected for human rating. We conduct significant tests using t-test.

Development Results
There are three important parameters in the proposed event-driven model, including the beam size B, the fluency weight λ and the number of candidate events N . We find the optimum parameters on development dataset in this section. For efficiency, the three parameters are optimized separately. The best performance is achieved with B = 8, λ = 0.4 and N = 10. We report the model results on the development dataset to study the influences of the three parameters, respectively, with the other two parameters being set with their best value.

Influence of Beam Size
We perform experiments with different beam widths. Figure 6 shows the results of the proposed model with beam sizes of 1, 2, 4, 8, 16, 32, 64. As can be seen, our model can achieve the best performances when the beam size is set to 8. Larger beam sizes do not bring better results.

Influence of Fluency Weight
The fluency score is used for generating readable titles, while the edge score is used for generating informative titles. The balance between them is important. By default, we set one to the weight of edge score, and find the best weight λ for the fluency score. We set λ ranging from 0 to 1 with and interval of 0.1, to investigate the influence of this parameter 8 . Figure 7 shows the results. The best result is obtained when λ = 0.4.

Influence of Candidate Event Count
Ideally, all the sentences of an original text should be considered in multi-sentence compression. But an excess of sentences would bring more noise. We suppose that the number of candidate events N is important as well. To study its influence, we report the model results with different N , from 1 to 15 with an interval of 1. As shown in Figure  8, the performance increases significantly from 1 to 10, and no more gains when N > 10. The performance decreases drastically when M ranges from 12 to 15. Table 1 shows the final results on the test dataset. The performances of the proposed eventdriven model are shown by EventRank. In addition, we use our graph-based MSC model to 8 Preliminary results show that λ is better below one. 9 The mark * denotes the results are inaccurate, which are guessed from the figures in the published paper.    SentRank, PhraseRank and EventRank to denote their MSC method and our proposed MSC, respectively, applying them, respectively. As shown in Table 1, better performance is achieved by our MSC, demonstrating the effectiveness of our proposed MSC. Similarly, the event-driven model can achieve the best results. We report results of previous state-of-the-art systems as well. SentRank+SSC denotes the result of Erkan and Radev (2004), which uses our SentRank and SSC to obtain the final title. Topiary denotes the result of Zajic et al. (2005), which is an early abstractive model. Woodsend denotes the result of Woodsend et al. (2010), which is an abstractive model using a quasisynchronous grammar to generate a title. As shown in Table 1, MSC is significantly better than SSC, and our event-driven model achieves the best performance, compared with state-of-the-art systems.

Final Results
Following Alfonseca et al. (2013), we conduct human evaluation also. The results are shown in Table 2, by three aspects: informativeness, fluency and coherence. The overall tendency is similar to the results, and the event-driven model achieves the best results.

Example Outputs
We show several representative examples of the proposed event-driven model, in comparison with the extractive and abstractive models.
The examples are shown in Table 3.
In the first example, the results of both SentRank and PhraseRank contain the redundant phrase "catastrophe Tuesday". The output of PhraseRank is less fluent compared with that of SentRank. The preposition "for" is not recovered by the headline generation system PhraseRank. In contrast, the output of EventRank is better, capturing the major event in the reference title.  In the second example, the outputs of three systems all lose the phrase "Ibero-American summit". SentRank gives different additional information compared with PhraseRank and EventRank. Overall, the three outputs can be regarded as comparable. PhraseRank also has a fluency problem by ignoring some function words.
In the third example, SentRank does not capture the information on "demands for talks". PhraseRank discards the preposition word "for". The output of EventRank is better, being both more fluent and more informative.
From the three examples, we can see that SentRank tends to generate more readable titles, but may lose some important information. PhraseRank tends to generate a title with more important words, but the fluency is relatively weak even with MSC. EventRank combines the advantages of both SentRank and PhraseRank, generating titles that contain more important events with complete structures. The observation verifies our hypothesis in the introduction -that extractive models have the problem of low information coverage, and abstractive models have the problem of poor grammaticality. The event-driven mothod can alleviate both issues since event offer a trade-off between sentence and phrase.

Related Work
Our event-driven model is different from traditional extractive (Dorr et al., 2003;Erkan and Radev, 2004;Alfonseca et al., 2013) and abstractive models (Zajic et al., 2005;Soricut and Marcu, 2007;Woodsend et al., 2010;Xu et al., 2010) in that events are used as the basic processing units instead of sentences and phrases. As mentioned above, events are a trade-off between sentences and phrases, avoiding sparsity and structureless problems. In particular, our event-driven model can interact with sentences and phrases, thus is a light combination for two traditional models.
The event-driven model is mainly inspired by Alfonseca et al. (2013), who exploit events for multi-document headline generation. They leverage titles of sub-documents for supervised training. In contrast, we generate a title for a single document using an unsupervised model. We use novel approaches for event ranking and title generation.
In recent years, sentence compression (Galanis and Androutsopoulos, 2010;Yoshikawa and Iida, 2012;Wang et al., 2013;Thadani, 2014) has received much attention. Some methods can be directly applied for multidocument summarization (Wang et al., 2013;. To our knowledge, few studies have been explored on applying them in headline generation. Multi-sentence compression based on word graph was first proposed by Filippova (2010). Some subsequent work was presented recently. Boudin and Morin (2013) propose that the key phrase is helpful to sentence generation. The key phrases are extracted according to syntactic pattern and introduced to identify shortest path in their work. Mehdad et al. (2013;Mehdad et al. (2014) introduce the MSC based on word graph into meeting summarization. Tzouridis et al. (2014) cast multi-sentence compression as a structured predication problem. They use a largemargin approach to adapt parameterised edge weights to the data in order to acquire the shortest path. In their work, the sentences introduced to a word graph are treated equally, and the edges in the graph are constructed according to the adjacent order in original sentence.
Our MSC model is also inspired by Filippova (2010).
Our approach is more aggressive than their approach, generating compressions with arbitrary length by using a different edge construction strategy. In addition, our search algorithm is also different from theirs. Our graph-based MSC model is also similar in spirit to sentence fusion, which has been used for multi-document summarization (Barzilay and McKeown, 2005;Elsner and Santhanam, 2011).

Conclusion and Future Work
We proposed an event-driven model headline generation, introducing a graph-based MSC model to generate the final title, based on a set of events. Our event-driven model can incorporate sentence and phrase salience, which has been used in extractive and abstractive HG models. The proposed graph-based MSC model is not limited to our event-driven model. It can be applied on extractive and abstractive models as well. Experimental results on DUC-04 demonstrate that event-driven model can achieve better results than extractive and abstractive models, and the proposed graph-based MSC model can bring improved performances compared with previous MSC techniques. Our final event-driven model obtains the best result on this dataset.
For future work, we plan to explore two directions. Firstly, we plan to introduce event relations to learning event salience. In addition, we plan to investigate other methods about multisentence compression and sentence fusion, such as supervised methods.