Improving ROUGE for Timeline Summarization

Current evaluation metrics for timeline summarization either ignore the temporal aspect of the task or require strict date matching. We introduce variants of ROUGE that allow alignment of daily summaries via temporal distance or semantic similarity. We argue for the suitability of these variants in a theoretical analysis and demonstrate it in a battery of task-specific tests.


Introduction
There is an abundance of reports on events, crises and disasters. Timelines summarize and date these reports in an ordered overview to combat information overload.
2010-05-06 BP tries to stop the spill by lowering a 98-ton "containment dome" over the leak. The effort eventually fails, as crystallized gases cause the containment dome to become unexpectedly buoyant.
2010-05-26 BP begins "top kill" attempt, shooting mud down the drillpipe in an attempt to clog the leaking well. After several days, the effort is abandoned.
2010-05-27 President Obama announces a six-month moratorium on new deepwater drilling in the gulf.

2010-05-14
Then-BP CEO Tony Hayward tells reporters that the amount of oil spilled is relatively small given the Gulf of Mexico's size.

2010-05-28
Hayward says the "top kill" effort to plug the well is progressing as planned and had a 60 to 70 percent chance of success, the same odds he gave before the maneuver. The next day the company announces that the effort failed.  Table 1 shows parts of journalist-generated timelines. Approaches for automatic timeline summarization (TLS) use such edited timelines as reference timelines to gauge their performance (Chieu and Lee, 2004;Yan et al., 2011b;Tran et al., 2013;Wang et al., 2016). For evaluation, most research uses the standard summarization evaluation metric ROUGE (Lin, 2004) without respecting the specific characteristics of TLS.
In this paper, we identify weaknesses of currently used evaluation metrics for TLS. We devise new variants of ROUGE to overcome these weaknesses and show the suitability of the variants with a theoretical and empirical analysis. A toolkit that implements our metrics is available for download as open source. 1

Task Description and Notation
Given a query (such as BP oil spill) TLS needs to (i) extract the most important events for the query and their corresponding dates and (ii) obtain concise daily summaries for each selected date (Allan et al., 2001;Chieu and Lee, 2004;Yan et al., 2011b;Tran et al., 2015;Wang et al., 2016).
Formally, a timeline is a sequence (d 1 , s 1 ), . . . , (d k , s k ) where the d i are dates and the s i are summaries for the dates d i . Given are a query q and an associated corpus C q that contains documents relevant to the query. The task of timeline summarization is to generate a timeline s q based on the documents in C q . The number of dates in the generated timeline as well as the length of the daily summaries are typically controlled by the user. For evaluation we assume access to one or more reference timelines R q = {r q 1 , . . . , r q nq }. In our notation we usually drop the query sub-/superscript.
For a timeline t, D t denotes the set of days in t. For a set of timelines T , we set D T = ∪ t∈T D t .

Current Evaluation Metrics
We now describe evaluation metrics for TLS and related tasks.

ROUGE
Most work on TLS adopts the ROUGE toolkit that is used for for standard summarization evaluation (Lin, 2004). ROUGE metrics evaluate a system summary s of one or more texts against a set R of reference summaries (without accounting for dating summaries). The most popular variants of ROUGE are the ROUGE-N metrics which measure the overlap of N-grams in system and reference summaries. Several ROUGE metrics are well correlated with human judgment (Graham, 2015).
For a summary c, let us define the set of c's Ngrams as ng(c). cnt c (g) is the number of occurrences of an N-gram g in c. For two summaries c 1 and c 2 , cnt c 1 ,c 2 (g) = min{cnt c 1 (g), cnt c 2 (g)} is the minimum number of occurrences of g in both c 1 and c 2 .
ROUGE-N recall is then defined as 2 rec(R, s) = r∈R g∈ng(r) cnt r,s (g) r∈R g∈ng(r) cnt r (g) , while ROUGE-N precision is defined as .
(2) ROUGE-N F 1 is the harmonic mean of recall and precision.
Concatenation-based ROUGE. The simplest and most popular way to apply ROUGE to TLS, which we refer to as concat, is to run ROUGE on documents obtained by concatenating the items of the timelines (Takamura et al., 2011;Yan et al., 2011a;Nguyen et al., 2014;Wang et al., 2016). Given a timeline t = (d 1 , s 1 ), . . . , (d k , s k ), we concatenate the s i , which yields a document s . In s all date information is lost. We apply this transformation to the reference and the system timelines and use ROUGE on the resulting documents. This method discards any temporal information. As a result, different datings of the same event are not penalized. Most work does not address this issue at all. An exception is Takamura et al. (2011), who ignore word matches when the matched word only appears in a summary where the time difference exceeds a pre-specified constant. However, it is left open how to set this constant and different datings of the same event below the threshold difference would again not receive any penalty.
Date-agreement ROUGE. A more principled method of accounting for temporal information is to evaluate the quality of the summary for each day individually (Tran et al., 2013;Wang et al., 2015). We refer to this method as agreement. For a date d, a set of reference timelines R and a system timeline s, we set R(d) to the set of summaries for d in R. 3 R(d) can be empty if the date is not included in any timeline. s(d) is the (possibly empty) summary of d in s. We define recall for a date d as . (3) rec(d, R, s) can be extended to the set of dates D R , typically by micro-averaging, that is (4) The handling of precision is analogous: instead of the formula for ROUGE recall we use the formula for ROUGE precision and average with respect to D s instead of D R .
While this metric accounts for temporal information, it requires that dates in reference and generated timelines match exactly. Otherwise, a score of 0 is assigned. For example, in the BP oil spill example in Table 1, the first timeline would get a score of 0 when comparing it with the second timeline, even though both timelines report on the existence and later failure of the "top kill" effort, although on different dates. This effect can be particularly problematic for longer-lasting events.

Other Metrics
Some work evaluates TLS manually (Chieu and Lee, 2004;Tran et al., 2015). However, such evaluation is costly.
A related task to TLS is the TREC update summarization task (Aslam et al., 2015). In contrast to TLS, this task requires online summarization by presenting the input as a stream of documents. The metric employed relies on manually matching sentences of reference and system timelines. Kedzie et al. (2015) modify TREC metrics for a fully automatic setting, but still need a manually optimized threshold for establishing semantic matching. Moreover, the matching is binary: two summaries either match or do not match. The metric does not incorporate information about the degree of similarity between two summaries.
Lastly, in the DUC 2007 and TAC 2008-2011 evaluation campaigns a different type of update summarization was evaluated: the objective was to create and then update a multi-document summary with new information (see, e.g., Owczarzak and Dang (2011)). This task differs fundamentally from TLS and TREC-style update summarization, since no individual summaries for dates have to be created. Evaluation metrics specifically designed for the task employ a combination of ROUGE scores to simultaneously reward similarity to human-generated summaries and penalize redundancy with respect to the original machinegenerated summary (Conroy et al., 2011).

Alignment-based ROUGE
From the analysis in the previous section we see that a metric for TLS should take temporal and semantic similarity of daily summaries into account, while not requiring an exact match between days.
We now propose variants of ROUGE that fulfill this desideratum. The main idea is that daily summaries that are close in time and that describe the same event or very similar events should be compared for evaluation. For example, the daily summaries that report on the "top kill" effort in the example in Table 1 should be compared. To do so, we first align dates in system and reference timelines. 4 ROUGE scores are then computed for the summaries of the aligned dates.

Formal Definition
Let R be a set of reference timelines and let s be a system timeline. The proposed alignment-based ROUGE recall relies on a mapping that assigns each date d r ∈ D R in some reference timeline a date d s ∈ D s in the system timeline. For evaluation, the summaries for the aligned dates are compared. 5 4 We are inspired by Luo (2005) who devises an alignment-based metric for coreference resolution. 5 We only discuss how recall is computed. For computing precision we instead consider alignments f : Ds → DR and In order to penalize date differences when comparing summaries, each date pair (d r , d s ) ∈ D R × D s is associated with a weighting factor t dr,ds . In this paper, we only consider the weighting factor t dr,ds = 1 |d r − d s | + 1 where d r − d s is the difference between d r and d s in number of days. Given some alignment f , alignment-based ROUGE recall rec(R, s, f ) is then defined as

Computing Alignments
For computing alignments, we associate to every date pair (d r , d s ) ∈ D R ×D s another value, which is the cost c dr,ds of assigning d r to d s . We will study costs that depend on date distance and/or semantic similarity of the corresponding summaries. The goal is to find a mapping f * : D R → D s that minimizes the sum of the costs, i.e.

Instantiations
We consider three instantiations of the alignment problem presented above. They vary in the cost function and with respect to constraints on the alignment.
Date Alignment. For the first instantiation, which we call date alignment or align, the cost only depends on date distance, ignoring semantic similarity. We set We require that the alignment is injective. 6 In Table 1, for example, the daily summaries for 2010-05-27 and 2010-05-28 would be aligned.
apply the corresponding formulas for precision as discussed in Section 3. 6 If |DR| > |Ds|, some dr ∈ DR will be unaligned. For these dates we set the n-gram counts to 0 in the numerator of Equation 7.
Date-content Alignment. The second instantiation, date-content alignment or align+, also includes semantic similarity in the costs. An approximation of semantic similarity is represented by the ROUGE-1 F 1 score between two daily summaries. We set where R1(d r , d s ) is the ROUGE-1 F 1 score that compares the reference summaries for date d r with the system summary for date d s . Here, too, we require that the alignment is injective. The two daily summaries referring to the "top kill" effort in Table 1 would be aligned when this metric is employed.
Many-to-one Date-content Alignment. For our last metric (many-to-one date-content alignment or align+ m:1) we drop the injectivity requirement from align+.

Discussion
Complexity. If we require that f * is injective, as in align and align+, we face a linear assignment problem, for which polynomial-time algorithms exist (Kuhn, 1955). The optimal assignment for align+ m:1 can be computed by a simple greedy algorithm: for every date in D R we choose the date in D s such that the cost is minimal.
Generalizing agreement. Note that agreement, which relies on exact date match, also fits in our framework: we require f * to be injective and set t dr,ds = 1, c dr,ds = 0 iff d r = d s , and t dr,ds = 0, c dr,ds = ∞ otherwise for all (d r , d s ) ∈ D R ×D s .

Tests for Metrics
An evaluation metric should behave as expected when task-specific operations are performed on output (Moosavi and Strube, 2016). For example, in TLS, removing a date (and its summary) from a reference timeline should decrease recall when comparing the timeline to itself. A metric cannot be suitable if it does not pass such tests.
We now devise and evaluate tests for the metrics discussed in this paper. Eventually, metrics that pass the tests should be checked for correlation with human judgment. We defer such an experiment to future work.

Test Definitions
We derive tests that examine whether well-defined basic operations on reference timelines affect the metrics as expected. An example is the date removal operation described above. Other basic operations are date addition, merging and shifting. In order to have a controlled environment we apply all operations to copies of reference timelines. Comparing a reference timeline to itself gives precision, recall and F 1 score of 1. Comparing a modified version to the original timeline should decrease precision and/or recall, depending on the operation. We apply the following operations: • Remove: remove a random date and its summary. Precision should stay 1, recall should decrease. • Add: for the first date not in the reference timeline, add a summary consisting of the first sentence of the first article of that day from the associated corpus. Precision should decrease, recall should stay 1. • Merge: merge summaries of the closest pair of dates, breaking ties by temporal order. Precision and recall should decrease slightly. • Shift k days: shift each day by k days to the future. Precision and recall should decrease. The drop should increase as k increases.

Evaluation
We run the proposed tests 7 on the publicly available timeline17 data set (Tran et al., 2013), which contains 17 timelines across nine topics and associated corpora. We apply each operation to each timeline. We then compare each modified timeline to the corresponding original timeline. We evaluate using variants based on ROUGE-1 and ROUGE-2, which are the most popular ROUGE-N metrics for evaluating TLS. Table  2 shows averaged results over all timelines for ROUGE-1 (ROUGE-2 yielded similar results).
We find that the frequently used concat is not a suitable metric for TLS. It is insensitive to merging and date shifting as it does not respect temporal information. agreement has the expected behavior for all tests, but, due to the required exact date matching, faces a very high drop for even minor date shifting and does not differentiate well between shifting one day and shifting five days.  Table 2: Tests on timeline17. Numbers are difference to 1 according to ROUGE-1-based metrics.
The alignment-based metrics show the most desirable behavior according to our criteria: they pass all tests and the drops caused by shifts are lower and differentiation is better than for agreement. For the other tests, these metrics behave similarly to agreement. Including semantic similarity (align+) further decreases drops in date shifting. Except for the Shift 1 day test, manyto-one-alignments (align+ m:1) yield the most lenient results of all alignment-based metrics.

Conclusions and Future Work
Current evaluation metrics for TLS are not suitable. In a formal and empirical analysis we identified weaknesses of metrics encountered in the literature. We devised a family of alignment-based ROUGE variants tailored to TLS. We found that these metrics exhibit the desired behavior when applying a battery of task-specific tests.
In future work we will study the correlation of TLS metrics with human judgment. In order to optimize correlation, we will also investigate more content and date similarity measures for computing and weighting optimal alignments.