Towards Automatic Construction of News Overview Articles by News Synthesis

In this paper we investigate a new task of automatically constructing an overview article from a given set of news articles about a news event. We propose a news synthesis approach to address this task based on passage segmentation, ranking, selection and merging. Our proposed approach is compared with several typical multi-document summarization methods on the Wikinews dataset, and achieves the best performance on both automatic evaluation and manual evaluation.


Introduction
There are usually many news articles about a news event, and news summaries can be used for readers to quickly learn the most salient information of the news articles. News summaries in previous studies are usually very short, and most of them consist of about one or two hundred words. However, in many circumstances, readers want to learn more about an event, but the news summary is insufficient to read, and people are reluctant to read each news article one by one. A possible solution to this problem is constructing a long and comprehensive news overview article to summarize and present all important facts about the news event in an unbiased way. The news overview articles can be considered long summaries, however, news overview articles are more comprehensive and the article texts are harder to arrange and organize.
In this paper, we conduct a pilot study to investigate the new task of automatic construction of a news overview article from a set of news articles about an event. We argue that traditional multidocument summarization methods can be applied to this task, but they do not perform well because sentence-based extraction used in these method-s is not suitable for constructing and organizing a long article. Instead, we propose a news synthesis approach to address this task. Our approach uses passage as the basic unit. In this study, passage does not mean a natural paragraph, but means a block of text (maybe multiple paragraphs) about a subtopic of an event. Our approach first segments news articles into passages with the SenTiling algorithm, and then ranks the passages with the Di-vRank algorithm. Finally, it selects and merges a few passages to construct the long news overview article.
We automatically build an evaluation dataset based on English Wikinews 1 . Most Wikinews articles are synthesis articles and they are written using information from other online news sources. All the important facts available from all sources about a news event are combined into a single article for the reader's convenience, and the information is presented in a neutral manner avoiding the bias that may be present in other news sources. Therefore, we treat a Wikinews article as an ideal overview article (i.e., reference) of the source news articles.
We compare our proposed approach with several typical multi-document summarization methods based on the Wikinews dataset. The results are very promising and our approach achieves the best performance on both automatic evaluation and manual evaluation. In this study, we demonstrate the feasibility of automatic construction of long overview articles from a set of news articles.
The contributions of this paper are summarized as follows: 1) we are the first to investigate the task of automatic construction of news overview articles from a set of source news articles; 2) we automatically build an evaluation dataset based on Wikinews; 3) we propose a news passage-based synthesis approach to address this task; 4) evaluation results verify the efficacy of our approach.

Our News Synthesis Approach
We propose a news synthesis approach to automatic construction of news overview articles from a set of source news articles. Our approach uses passage as the basic unit, and consists of three main steps: passage segmentation, passage ranking, and passage selection and merging. The rationale of using passage rather than sentence lies in that 1) the sentences in a passage are more complete and coherent than multiple sentences selected from different places in different documents; 2) it is easier to arrange several passages than to arrange a large number of sentences.

Passage Segmentation
In this step, we aim to segment each source news article into several passages, where each passage represents a subtopic of the event. In order to achieve this goal, we adopt the TextTiling algorithm (Hearst, 1997), which is a popular algorithm for discovering subtopic structure using term repetition. The original TextTiling algorithm usually splits a sentence into different passages, and in order to remedy this problem, we slightly modify the TextTiling algorithm and our new SenTiling algorithm consists of three steps: Tokenization refers to the division of the input text into individual lexical units, and the tokens are converted to lower-case characters and stemmed using the Porter stemmer.
Lexical score determination refers to assigning a lexical score of each gap between text blocks. To avoid the incomplete sentence in the segmentation result, we regard a sentence as a text block and calculate a lexical score for the gap at the end of each sentence by the cosine similarity value between 100 words before and after the gap. We do not use natural paragraphs as blocks because their lengths are highly irregular.
Boundary identification assigns a depth score to each sentence gap and then determines the passages to assign to a document. The depth score is computed in the same way as in (Hearst, 1997) and it corresponds to how strongly the cues for a subtopic changed on both sides of a given gap and is based on the distance from the peaks on both sides of the valley to that valley. Since every gap is a potential segment boundary. We select a boundary only if the depth score exceeds the average depth scores s minus the standard deviation σ of their scores (thus assuming that the scores are normally distributed), as s − σ.

Passage Ranking
We use DivRank (Mei et al., 2010) to rank passages, because DivRank automatically balances the prestige and the diversity of the top ranked passages in a principled way. It is motivated from a general time-variant random walk process known as the vertex-reinforced random walk. Let p T (v) be the probability that the walk is at state v at time T , and p T (u, v) be the transition probability from any state u to any state v at time T .
is the organic transition probability prior to any reinforcement, which is estimated as in a regular time-homogenous random walk by the normalized cosine similarity value between u and v.
After a sufficiently large T , the reinforced random walk will converge to a stationary distribution, and each passage node will be assigned with a rank score.

Passage Selection and Merging
We aim to select several important but nonredundant passages to form the overview article. The selection can be done according to the Di-vRank scores because the scores balance the prestige and the diversity of most of the top ranked passages, but it occasionally happens that two relevant passages both get high scores. In order to remedy this problem and make the content for each subtopic more comprehensive and complete, we further merge relevant passages by adding informative sentences from relevant passages into the selected passage. The greedy selection process is illustrated in Algorithm 1.
The function merge(g i * , g j * ) merges the sentences of g i * into g j * one by one. If the average similarity between a sentence s i * ,k in g i * and each sentence in g j * is less than ξ, we insert the sentence s i * ,k into g j * and find the insertion position between two sentences s j * ,m and s j * ,n in g j * , where the average of the similarity between s i * ,k and s j * ,m , and the similarity between s i * ,k Algorithm 1 Passage Selection and Merging Input: Passage set G = g 1 , ..., g n and each passage g i is assigned with a DivRank score p(g i ); The cosine similarity value gSim i,j between any two passages g i and g j ; Output: The passage set O in the overview article; 1: Initialize O = φ 2: while G = φ and O does not reach the length limit do 3: if gSim i * ,j * > τ then 7: g j * = merge(g i * , g j * ) 8: end if 11: end while 12: return O and s j * ,n is the largest.
Finally, we arrange the passages in O with topological sorting to form the overview article. We follow two principles: 1) If passages u and v are from the same news article and u is before v, they should be adjacent and have the same order in the overview article; 2) If passages u and v are from different news articles and u has higher DivRank score than v, u and the passages coming from the same news article with u should be placed before v in the overview article.

Evaluation Dataset and Baselines
As mentioned in the introduction section, we used Wikinews to construct the evaluation dataset. We first crawled 18121 English Wikinews and their source news articles via the associated URLs. However, many Wikinews articles have very few source news articles and they are very short, and moreover, the URLs for many of the source news are out of date. We filtered the Wikinews articles for which the number of available source news articles are less than 5. Finally, we selected 100 longest Wikinews from the remaining set for testing 2 . The average number of words of Wikinews in the test set is 598 and the average number of total words of their source news articles is 2136. 2 The dataset is accompanied and it will be released soon. Accordingly, the length limit of overview articles produced by different methods is 600 words.
For our approach, τ is set to 0.4 and ξ is set to 0.5 based on an additional small development set chosen from the remaining Wikinews set. λ in the DivRank algorithm is set to 0.85 by default. Under the control of these thresholds, we only merge a very small number of passages and insert very few sentences from one passage to another passage, so the influence of passage merging on the coherence is very subtle.

Evaluation Results and Analysis
Automatic Evaluation: Similar to traditional summarization tasks, we use the ROUGE metrics (Lin and Hovy, 2003) to automatically evaluate the quality of peer overview articles against the gold-standard references. We use ROUGE-1.5.5 and report the F-scores of ROUGE-1 (R-1), ROUGE-2 (R-2) and ROUGE-SU4 (R-SU4).
Firstly, we perform evaluation on the whole articles and Table 1 shows the comparison results. We can see that our approach outperforms all the baseline methods with respect to ROUGE-2 and ROUGE-SU4. The Submodular method achieves the highest ROUGE-1 score, but our approach also achieves very high ROUGE-1 score, which is very close to that of the Submodular method.   Table 3: Comparison results on two-part evaluation II tent organization in long articles, we split each article (both peer article and reference article) into two parts with equal length, and compare the first parts in the peer and reference articles, and then compare the second parts in the peer and reference articles. Lastly, the ROUGE scores are averaged across the two parts. Table 2 shows the comparison results based on this evaluation protocol (twopart evaluation I). Furthermore, we allow the first part in a reference article to match with the second part in a peer article, and vice versa. We allow one-to-one matching and find the optimal matching between the two sets of parts, which refers to the matching with the largest sum of the similarity values of the matched parts. We then compute and average the ROUGE scores of the matched parts. Table 3 shows the comparison results based on this evaluation protocol (two-part evaluation II). We can see from Tables 2 and 3 that our proposed approach performs much better than the baseline methods over all three metrics.
Manual Evaluation: We randomly select 30 test cases for manual evaluation. We employ  Table 4: Manual evaluation results three students as human judges and each judge is asked to read the reference Wikinews and the peer overview article produced by each method, and then give a rating score between 1 and 5 with respect to three aspects: content coverage, readability and overall responsiveness. 5 means "very good", 3 means "acceptable", and 1 means "very bad". The methods producing the articles are blind to the judges. Finally, the rating scores with respect to each aspect across different test cases are averaged, and then averaged across the three judges. Table 4 shows the manual evaluation results. We can see that our proposed approach can produce news overview articles with better content coverage, readability and overall responsiveness than baseline methods. The quality of the news overview articles is generally acceptable by the human judges. In all, our proposed approach are more effective than typical multi-document summarization methods for addressing this challenging task. It is feasible to automatically construct news overview articles with news synthesis.
Other related work includes automatic generation of well-structured Wikipedia articles (Sauper and Barzilay, 2009;Yao et al., 2011). Different from Wikinews, Wikipedia articles usually have domain-dependent templates for content filling and organization.

Conclusion
In this pilot study we proposed a news synthesis approach to address the challenging task of automatic generation of news overview articles. Evaluation results on Wikinews verified the efficacy and feasibility of the proposed approach. In future work, we will investigate supervised learning methods for passage ranking and selection, and try to paraphrase the selected passages.