Examining the State-of-the-Art in News Timeline Summarization

Previous work on automatic news timeline summarization (TLS) leaves an unclear picture about how this task can generally be approached and how well it is currently solved. This is mostly due to the focus on individual subtasks, such as date selection and date summarization, and to the previous lack of appropriate evaluation metrics for the full TLS task. In this paper, we compare different TLS strategies using appropriate evaluation frameworks, and propose a simple and effective combination of methods that improves over the stateof-the-art on all tested benchmarks. For a more robust evaluation, we also present a new TLS dataset, which is larger and spans longer time periods than previous datasets.


Introduction
Timelines of news events can be useful to condense long-ranging news topics and can help us understand how current major events follow from prior events.Timeline summarization (TLS) aims to automatically create such timelines, i.e., temporally ordered time-stamped textual summaries of events focused on a particular topic.While TLS has been studied before, most works treat it as a combination of two individual subtasks, 1) date selection and 2) date summarization, and only focus on one of these at a time (Tran et al., 2013a(Tran et al., ,b, 2015b)).However, these subtasks are almost never evaluated in combination, which leaves an unclear picture of how well TLS is being solved in general.Furthermore, previously used evaluation metrics for the date selection and timeline summarization tasks are not appropriate since they do not consider the temporal alignment in the evaluation.Just until recently, there were no established experimental settings and appropriate metrics for the full TLS task (Martschat andMarkert, 2017, 2018).
Table 1: Excerpt of an automatically constructed timeline about the company Enron, using article headlines as summaries.The shaded parts indicate that the date or summary matches entries in a corresponding humanwritten ground-truth timeline.
In this paper, we examine existing strategies for the full TLS task and how well they actually work.We identify three high-level approaches: 1) Direct summarization treats TLS like text summarization, e.g., by selecting a small subset of sentences from a massive collection of news articles; 2) The date-wise approach first selects salient dates and then summarizes each date; 3) Event detection first detects events, e.g., via clustering, selects salient events and summarizes these individually.The current state-of-the-art method is based on direct summarization (Martschat and Markert, 2018).We therefore focus on testing the two remaining strategies, which have not been appropriately evaluated yet and allow for better scalability.
We propose a simple method to improve date summarization for the date-wise approach.The method uses temporal expressions (textual references to dates) to derive date vectors, which in turn help to filter candidate sentences to summarize particular dates.With this modification, the date-wise approach obtains improved state-of-the-

Date Selection
Supervised machine learning has been proposed to predict whether dates appear in ground-truth timelines (Kessler et al., 2012;Tran et al., 2013a).Tran et al. (2015b) use graph-based ranking of dates, which is reported to outperform supervised methods1 .

Date Summarization
Several approaches construct date summaries by picking sentences from ranked lists.The ranking is based on regression or learning-to-rank to predict ROUGE scores between the sentence and a ground-truth summary (Tran et al., 2013a,b).Tran et al. (2015a) observe that users prefer summaries consisting of headlines to summaries consisting of sentences from article bodies.Steen and Markert (2019) propose abstractive date summarization based on graph-based sentence merging and compression.Other works propose the use of additional data, such as comments on social media (Wang et al., 2015), or images (Wang et al., 2016).

Full Timeline Summarization
Chieu and Lee ( 2004) produce timelines by ranking sentences from an entire document collection.The ranking is based on summed up similarities to other sentences in an n-day window.Nguyen et al. (2014) propose a pipeline to generate timelines consisting of date selection, sentence clustering, and ranking.Martschat and Markert (2018) adapt submodular function optimization, commonly used for multi-document summarization, for the TLS task.The approach searches for a combination of sentences from a whole document collection to construct a timeline and is the current state-ofthe-art for full TLS.Steen and Markert (2019) use a two-stage approach consisting of date selection and date summarization to build timelines.Other examples of automatic timeline generation can be found in the social media-related literature, where microblogs are often clustered before being summarized (Wang et al., 2014;Li and Cardie, 2014).We explore a similar framework for evaluating clustering-based TLS.

Strategies for Timeline Summarization Problem Definition
We define the TLS setup and task as follows.Given is a set of news articles A, a set of query keyphrases Q, and a ground-truth (reference) timeline r, with l dates that are associated with k sentences on average, i.e., m = k * l sentences in total.The task is to construct a (system) timeline s that contains m sentences, assigned to an arbitrary number of dates.A simpler and stricter setting can also be used, in which s must contain exactly l dates with k sentences each.

Approach Types
A number of different high-level approaches can be used to tackle this task: 1. Direct Summarization: A is treated as one set of sentences, from which a timeline is directly extracted, e.g., by optimizing a sentence combination (Martschat and Markert, 2018), or by sentence ranking (Chieu and Lee, 2004).Among these, Martschat and Markert (2018)'s solution for the full TLS task has state-of-theart accuracy but does not scale well.
2. Date-wise Approach: This approach selects l dates and then constructs a text summary of k sentences on average for each date.
3. Event Detection: This approach first detects events in A, e.g., by clustering similar articles, and then identifies the l most important events and summarizes these separately.
Since no prior work has analyzed the latter two categories for the full TLS task, we discuss and develop such approaches next.

Date-wise Approach
The approach described here mostly consists of existing building blocks, with a few but important modifications proposed from our side.

Defining the Set of Dates
First, we identify the set of possible dates to include in a timeline.We obtain these from (i) the publication dates of all articles in A and (ii) textual references of dates in sentences in A, such as 'last Monday', or '12 April'.We use the tool Hei-delTime 2 (Strötgen and Gertz, 2013) to detect and resolve textual mentions of dates.
• MENTIONCOUNT: Ranking dates by the number of sentences that mention the date.
• SUPERVISED: Extracting date features and using classification or regression to predict whether a date appears in a ground-truth timeline.These features mostly include the publication count and different variants of counting date mentions.
Our experiments show that SUPERVISED works best, closely followed by MENTIONCOUNT (Appendix A.1). Figure 1 shows an example of publication and date mention counts and ground-truth dates over time.Two challenges are evident that date selection methods face: 1) These count signals usually do not perfectly correlate with groundtruth dates, and 2) high values often cluster around important dates, i.e., a "correct" date is often surrounded by other, "incorrect" dates with similarly strong signals.

Candidate Sentences for Dates
To summarize a particular date d, we first need to decide which articles or sentences we use as a source to create a summary from.Previous research has not explored this aspect much due to the separated treatment of subtasks.We propose a simple but effective heuristic to do this.We consider the following two sets to be the primary source of suitable candidate sentences: • P d : Sentences published on or closely after d.These often contain initial reports of events occurring on d.
• M d : Sentences that mention d.These sentences are from articles published at any point in time, and may retrospectively refer to d, or announce events on d beforehand3 .
We evaluate these two options in our experiments, and propose an heuristic that combines these, which we call PM-MEAN.We aim to find a subset of sentences in P d ∪ M d that are likely to mention important events happening on d.We convert all the sentences in the collection A to sparse bag-of-words (unigram) vectors with sentence-level TF-IDF weighting.We represent the sets of sentences P d and M d using the mean of their respective sentence vectors, x P d and x M d .The core assumption of the method is that the content shared between P d and M d is a good source for summarizing events on d.To capture this content, we build a date vector x d , so that we can compare sentence vectors against it to rank sentences.We set the value of x d for each dimension i in the feature space as follows: Thus the date vector x d is an average of x P d and x M d weighted by the sizes of P d and M d , with any features zeroed out if they are missing in either P d or M d .To rank sentences, we compute the cosine similarity between the vector x s of each candidate sentence s ∈ (P d ∪ M d ) to x d .We select the best-scoring candidate sentences by defining a threshold on this similarity.To avoid tuning this threshold, we use a simple knee point detection method (Satopaa et al., 2011) to dynamically identify a threshold that represents the "knee" (or elbow) in the similarity distribution.This set of best-scoring sentences is then used as the input for the final date summarization step.

Date Summaries
To construct the final timeline, we separately construct a summary for the l highest ranked dates.Prior to our main experiments, we test several multi-document summarization algorithms: • TEXTRANK: Runs PageRank on a graph of pairwise sentences similarities to rank sentences (Mihalcea and Tarau, 2004).
• CENTROID-RANK: Ranks sentences by their similarity to the centroid of all sentences (Radev et al., 2004).
• CENTROID-OPT: Greedily optimises a summary to be similar to the centroid of all sentences (Ghalandari, 2017).
• SUBMODULAR: Greedily optimizes a summary using submodular objective functions that represent coverage and diversity (Lin and Bilmes, 2011).
The only modification to these algorithms in our TLS pipeline is that we prevent sentences not containing any topic keyphrases from query Q to be included in the summary.CENTROID-OPT has the best results (Appendix A.1) and is used in the main experiments.

Timeline Construction
The date-wise approach constructs a timeline as follows: first, rank all potential dates using one of the date selection approaches described, then pick the l highest ranked ones, pick candidate sentences for each date, and summarize each date individually from the according candidate set, using k sentences.We might not be able to summarize a particular date due to the keyword constraint in the summarization step.Whenever this is the case, we skip to the next date in the ranked list, until l is reached.

Event Detection Approach
When humans are tasked with constructing a timeline, we expect that they reason over important events rather than dates.Conceptually, detecting and selecting events might also be more appropriate than selecting dates because multiple events can happen on the same day, and an event can potentially span multiple days.
To explore this, we test a TLS approach based on event detection by means of article clustering.The general approach can be summarized as follows: (1) Group articles into clusters; (2) Rank and select the l most important clusters; (3) Construct a summary for each cluster.Similarly to the date-wise approach, this mostly consists of existing building blocks that we adapt for TLS.

Clustering
For each input collection A, we compute sparse TF-IDF unigram bag-of-words vectors for all articles in A. We apply clustering algorithms to these vectors.To cluster articles, we use Markov Clustering (MCL) with a temporal constraint.MCL (Van Dongen, 2000) is a clustering algorithm for graphs, i.e., a community detection algorithm.It is based on simulating random walks along nodes in a graph.Ribeiro et al. (2017) use this approach for clustering news articles.
We convert A into a graph where nodes correspond to articles so that we can cluster the articles using MCL, with the following temporal constraint: Articles a 1 , a 2 are assigned an edge if their publication dates are at most 1 day apart from each other.
The edge weight is set to the cosine similarity between the TF-IDF bag-of-words vectors of a 1 and a 2 .The constraint on the publication dates ensures that clusters do not have temporal gaps.Furthermore, it reduces the number of similarity computations between pairs of articles considerably.We run MCL on this graph and obtain clusters by identifying the connected components in the resulting connectivity matrix4 .

Assigning Dates to Clusters
We define the cluster date as the date that is most frequently mentioned within articles of the cluster.We identify date mentions using the HeidelTime tool.

Cluster Ranking
To construct a timeline, we only need the l most important clusters.We obtain these by ranking and retaining the top-l clusters of the ranked list.We test the following scores to rank clusters by: • SIZE: Rank by the numbers of articles in a cluster.
• DATEMENTIONCOUNT: Rank by how often the cluster date is mentioned throughout the input collection.
• REGRESSION: Rank using a score by a regression model trained to predict importance scores of clusters.
For the regression-based ranking method, we represent clusters using the following features: number of articles in a cluster; number of days between the publication dates of the first and last article in the cluster; maximum count of publication dates of articles within a cluster; maximum mention count of dates mentioned in articles in a cluster; sum of mention counts of dates mentioned in articles in a cluster.We test two approaches to label clusters with target scores to predict.
• Date-Accuracy: This is 1 if the cluster date appears in the ground-truth, else 0.
• ROUGE: The ROUGE-1 F1-score5 between the summary of the cluster and the groundtruth summary of the cluster date.If the cluster date does not appear in the ground-truth, the score is set to 0.
We evaluate these different options (Appendix A.2) and observe that ranking by DATEMENTION-COUNT works better than the supervised methods, showing that predicting the suitability of clusters for timelines is difficult.

Cluster Summarization
We use the same multi-document summarization method that works best for the date-wise approach (CENTROID-OPT).

Timeline Construction
In summary, the clustering approach builds a timeline as follows: 1) cluster all articles, 2) rank clusters, 3) build a summary with k sentences for the top-l clusters, skipping clusters if a summary cannot be constructed due to missing keywords.Furthermore, we skip clusters if the date assigned to the cluster is already "used" by a previously picked cluster.Conceptually, this implies that we can only recognize one event per day.In initial experiments, this leads to better results than alternatives, e.g., allowing multiple summaries of length k per day.

Dataset
Tran et al. introduced the 17 Timelines (T17) (Tran et al., 2013a) and the CRISIS (Tran et al., 2015a) datasets for timeline summarization from news articles.However, we see the need for better benchmarks due to 1) a small number of topics in the T17 and CRISIS datasets (9 and 4 topics respectively), and 2) relatively short time span, ranging from a few months to 2 years.
Therefore, we build a new TLS dataset, called ENTITIES, that contains more topics (47) and longer time-ranges per topic, e.g., decades of news articles.In the following, we describe how we obtain ground-truth timelines and input article collections for this dataset.
Ground-Truth Timelines: We obtain groundtruth timelines from CNN Fast Facts6 , which has a collection of several hundred timelines grouped in categories, e.g., 'people' or 'disasters'.We pick all timelines of the 'people' category and a small number from other categories.
Queries: For each ground-truth timeline, we define a set of query keyphrases Q.By default, we use the original title of the timeline as the keyphrase.For people entities, we use the last token of the title to capture surnames only, which increases the coverage.We manually inspect the resulting sets of keyphrases and correct these if necessary.
Input Articles: For each entity from the groundtruth timelines, we search for news articles using TheGuardian API7 .We use this source because it provides access to all published articles starting from 1999.We search for articles that have exact matches of the queries in the article body.The timespan for the article search is set so that it extends the ground-truth timeline by 10% of its days before its first and after its last date.
Adjustments and Filtering: The ground-truth timelines are modified to be usable for TLS and to ensure they do not contain data not present in the document collection: • We remove entries in the ground-truth timelines if they do not specify year, month, and day of an event.
• Ground-truth timelines are truncated to the first and last date of the input articles.• Entries in the ground-truth timeline are removed if there is no input article published ± 2 days.
Afterwards, we remove all topics from the dataset that do not fulfill the following criteria: • The timeline must have at least 5 entries.
• For at least 50% of the dates present in the ground-truth timeline, textual references have to be found in the article collection (e.g., 'on Wednesday' or 'on 1 August'.).This is done to ensure that the content of the timelines is reflected to some degree in the article collection.
• There are at least 100 and less than 3000 articles containing the timeline-entity in the input articles.This is done to reduce the running time of experiments.
Dataset Characteristics: Tables 2 and 3 give an overview of properties of the two existing datasets and our new dataset, and mostly show averaged values over tasks in a dataset.An individual task corresponds to one ground-truth timeline that a TLS algorithm aims to simulate.#P ubDates refers to the number of days in an article collection A on which any articles are published.The compression ratio w.r.t.sentences ("comp.ratio (sents)") is m divided by the total number of sentences in A, and the compression ratio w.r.t dates is l divided by #P ubDates."Avg.date cov" refers to the average coverage of dates in the ground-truth timeline r by the articles in A. This can be counted by using publication dates in A, ("published"), or by textual date references to dates within articles in A ("mentioned").The fact that there are generally more ground-truth dates covered in textual date references compared to publication dates suggests making use of these date mentions.
T17 has longer (l), and more detailed (k) timelines than the other datasets, CRISIS has more articles per task, and ENTITIES has more topics, publication dates and longer time periods per task.

Evaluation Metrics
In our experiments, we measure the quality of generated timelines with the following two evaluation metrics, which are also used by Martschat and Markert (2018): • Alignment-based ROUGE F1-score: This metric compares the textual overlap between a system and a ground-truth timeline, while also considering the assignments of dates to texts.
• Date F1-score: This metric compares only the dates of a system and a ground-truth timeline.
We denote the alignment-based ROUGE-1 F1score as AR1-F and Date F1-score as Date-F1.

Experimental Settings
Concerning the datasets and task, we follow the experimental settings of Martschat and Markert (2018): • Each dataset is divided into multiple topics, each having at least one ground-truth timeline.
If a topic has multiple ground-truth timelines, we split the topic into multiple tasks.The final results in the evaluation are based on averages over tasks/ground-truth timelines, not over topics.
• Each task includes a set of news articles A, a set of keyphrases Q, a ground-truth timeline r, with number of dates (length) l, average number of summary sentences per date k, and total number of summary sentences m = l * k.  • In each task, we remove all articles from A whose publication dates are outside of the range of dates of the ground-truth timeline r of the task.Article headlines are not used.
• We run leave-one-out cross-validation over all tasks of a dataset.
• We test for significant differences using an approximate randomization test (Marcus et al., 1993) with a p-value of 0.05.
We use the following configurations for our methods: • A stricter and simpler version of the output size constraint: We produce timelines with the number of dates l and k sentences per date.
• In the summarization step of our methods, we only allow a sentence to be part of a summary if it contains any keyphrase in Q.As opposed to Martschat and Markert (2018), we still keep sentences not matching Q, e.g., for TF-IDF computation, clustering, and computing date vectors.

Methods Evaluated
We compare the following types of methods to address the full news TLS task.Direct summarization approaches: • CHIEU2004: Chieu and Lee ( 2004) An unsupervised baseline based on direct summarization.We use the reimplementation from Martschat and Markert (2018).
• MARTSCHAT2018: Martschat and Markert (2018) State-of-the-art method on the CRISIS and T17 datasets.It greedily selects a combination of sentences from the entire collection A maximizing submodular functions for content coverage, textual and temporal diversity, and a high count of date references8 .

Date-wise approaches:
• TRAN 2013 (Tran et al., 2013a): The original date-wise approach, using regression for both date selection and summarization, and using all sentences of a date as candidate sentences.
• PUBCOUNT: A simple date-wise baseline that uses the publication count to rank dates, and all sentences published on a date for candidate selection.We use CENTROID-OPT for summarization.
• DATEWISE: Our date-wise approach after testing different building blocks (see Appendix A.1).It uses supervised date selection, PM-MEAN for candidate selection and CENTROID-OPT for summarization.
Event detection approach based on clustering: • CLUST: We use DATEMENTIONCOUNT to rank clusters, and CENTROID-OPT for summarization, which are the best options according to our tests (see Appendix A.2).
Note that all methods apart from DATEWISE and CLUST have been proposed previously.

Oracles:
To interpret the alignment-based ROUGE scores better and to approximate their upper bounds, we measure the performance of three different oracle methods: • DATE ORACLE: Selects the correct (groundtruth) dates and uses CENTROID-OPT for date summarization.
• TEXT ORACLE: Uses regression to select dates, and then constructs a summary for each date by optimizing the ROUGE to the groundtruth summaries.
• FULL ORACLE: Selects the correct dates and constructs a summary for each date by optimizing the ROUGE to the ground-truth summaries.
We give more detail about these in Appendix A.3.

Results
Table 4 shows the final evaluation results.We reproduced the results of CHIEU2004 and MARTSCHAT2018 reported by Martschat and Markert (2018) using their provided code 9 .The other results are based on our implementations.Table 10 in Appendix A.6 shows several output examples across different methods.
6 Analysis and Discussion

Performance of TLS Strategies
Among the methods evaluated, DATEWISE consistently outperforms all other methods on all tested datasets in the alignment-based ROUGE metrics.The Date-F1 metric for this method is close to other methods, and not always better, which shows that the advantage of DATEWISE is due to the sentence selection (based on our heuristic date vectors) and summarization.Note that the date selection method is identical to TRAN2013.We conclude from these results that the expensive combinatorial optimization used in MARTSCHAT2018 is not necessary to achieve high accuracy for news TLS.CLUST performs worse than DATEWISE and MARTSCHAT2018, except on ENTITIES, where it outperforms MARTSCHAT2018.We find that for the other two datasets, CLUST often merges articles from close dates together that would belong to separate events on ground-truth timelines, which may suggest that a different granularity of clusters is required depending on the task.
DATE ORACLE and FULL ORACLE should theoretically have a 100% Date-F1.In practice, their Date-F1 scores turn out lower because, for some dates, no candidate sentences that match query Q 9 With the exception of CRISIS due to memory issues.can be found, which causes the dates to be omitted from the oracle timelines.
Based on the performance of different systems, the hardest dataset is ENTITIES, followed by CRI-SIS.

What makes TLS difficult?
While the ranking of methods is fairly stable, the performance of all methods varies a lot across the datasets and across individual tasks within datasets.
To find out what makes individual tasks difficult, we measure the Spearman correlation between AR1-F and several dataset statistics.The details are included in Appendix A.5.The correlations show that a high number of articles and publication dates and a low compression ratio w.r.t to dates generally decreases performance.This implies that highly popular topics are harder to summarize.The duration of a topic also corresponds to lower performance, but in a less consistent pattern.
The generally low performance across tasks and methods is likely influenced by the following factors: • The decision for human editors to include particular events in a timeline and to summarise these in a particular way can be highly subjective.Due to the two-stage nature of TLS, this problem is amplified in comparison to regular text summarization.
• Article collections can be insufficient to cover every important event of a topic, e.g., due to the specific set of news sources or the search technique used.

Running Time
DATEWISE and CLUST are up to an order of magnitude faster to run than MARTSCHAT2018 (Appendix A.4) since their date summarization steps only involve a small subset of sentences in an article collection.

Adjacent Dates and Redundancy
Automatically constructed timelines often contain a high amount of multiple adjacent dates, while this is not the case in ground-truth timelines.Summaries of such adjacent dates often tend to refer to the same event and introduce redundancy into a timeline.To quantify this, we count the proportion of those "date bigrams" in a chronologically ordered timeline, which are only 1 day apart.
The results (see Table 5) show that this is an issue  for MARTSCHAT2018 and DATEWISE, but less so for CLUST, which is designed to avoid this behavior.Note that MARTSCHAT2018 includes an objective function to reward diversity within a timeline, while DATEWISE has no explicit mechanism against redundancy among separate dates.Interestingly, when forcing DATEWISE to avoid selecting adjacent dates (by skipping such dates in the ranked list), the performance in all metrics decreases.In this case, high redundancy is a safer strategy for optimizing TLS metrics compared to enforcing a more balanced spread over time.Because of such effects, we advise to use automated evaluation metrics for TLS with care and to conduct qualitative analysis and user studies where possible.

Use of Titles
While using article titles can make timelines more readable and understandable (Tran et al., 2015a), we do not involve titles in our main experiments, in order to directly compare to MARTSCHAT2018, and due to the lack of titles in T17.The last row in Table 4 shows the results of a separate experiment with DATEWISE in which we build date summaries using titles only.Using only titles generally increases AR Precision at the cost of Recall.AR-F is negatively affected in CRISIS but does not change in ENTITIES.Figure 1 shows parts of a title-based timeline produced by DATEWISE.

Conclusion
In this study, we have compared and proposed different strategies to construct timeline summaries of long-ranging news topics: the previous stateof-the-art method based on direct summarization, a date-wise approach, and a clustering-based approach.By exploiting temporal expressions, we have improved the date-wise approach and yielded new state-of-the-art results on all tested datasets.Hence, we showed that an expensive combinatorial search over all sentences in a document collection is not necessary to achieve good results for news TLS.For a more robust and diverse evaluation, we have constructed a new TLS dataset with a much larger number of topics and with longer time-spans than in previous datasets.Most of the generated timelines are still far from oracle timeline extractors and leave large gaps for improvements.Potential future directions include a more principled use of our proposed heuristic for detecting content relevant to specific dates, the use of abstractive techniques, a more effective treatment of the redundancy challenge, and extending the new dataset with multiple sources.

A.4 Running Time
In Table 9 we compare the running time of DATE-WISE and MARTSCHAT2018 on the T17 and ENTI-TIES datasets 10 .The implementations of both our methods and of MARTSCHAT2018 make use of parallel computation to obtain pairwise similarities between sentences or documents where required.We do not parallelize our methods in any other way.We could not run MARTSCHAT2018 on the CRISIS dataset since it requires too much memory, which demonstrates the need for more scalable state-of-the-art methods.DATEWISE and CLUST are considerably faster on both datasets, due to their "divide-and-conquer" nature: The summarization step is applied to only l smaller portions of articles and sentences, instead of the entire set.Note that part of the time is required to run the evaluation tool to compute alignment-based ROUGE.

A.5 Correlations between Performance and Dataset Characteristics
Detailed results of correlations between different methods and different dataset characteristics are shown in Table 8. 10 On a machine with 16 3.70GHzIntel CPUs and 32GB memory.

A.6 Output Examples
Table 10 shows parts of timelines produced by different methods for a selection of dates that all methods have selected.The topics are taken from the ENTITIES dataset.The examples demonstrate different levels of detail in describing particular events.

Figure 1 :
Figure 1: Counts of published articles and textual mentions across dates in an article collection about Enron.

Table 2 :
Dataset Statistics for the TLS task (i)

Table 3 :
Dataset Statistics for the TLS task (ii)

Table 4 :
Results on the full TLS task.indicates a significant improvement over Tran 2013, • over CLUST, and † over MARTSCHAT2018.DATEWISE (titles) is not included in the significance testing.

Table 5 :
Proportion of adjacent dates of timelines produced by different methods, and the ground-truth timelines.

Table 8 :
Correlations between Task Properties and Method Performance.

Table 9 :
Running time comparison between current state-of-the-art method MARTSCHAT2018 and the methods we implemented.