KWB: An Automated Quick News System for Chinese Readers

We present an automated quick news sys-tem called KWB. KWB crawls and collects around the clock news items from over 120 news websites in mainland China, eliminates duplicates, and retrieves a summary of up to 600 characters for each news article using a proprietary sum-mary engine. It then uses a Labeled-LDA classiﬁer to classify the remaining news items into 19 categories, computes popularity ranks called PopuRank of the newly collected news items in each category, and displays the summaries of news items in each category sorted according to Popu-Rank together with a picture, if there is any, on http://www.kuaiwenbao.com and mobile apps. We will describe in this paper the system architecture of KWB, the data crawler structure, the functionalities of the central database, and the deﬁnition of PopuRank. We will show, through experiments, the running time of obtaining PopuRank. We will also demonstrate the use of KWB.


Introduction
We are living in the era of information explosion. To help people obtain information quickly, we would want to construct an automated system that collects information and provides accurate summarization to the user in a timely fashion. This would be a system that integrates advanced technologies and current research results on text automation, including data collection, storage, classification, ranking, summarization, web displaying, and app development. KWB is such a system that collects news items from the Internet and provides to the reader summarization and PopuRank * This work was supported in part by a grant from Wantology. Correspondence: wang@cs.uml.edu. of each news item, making it easier for people to obtain critical information quickly.
In this paper we will describe the data collection, data storage, and popular ranking of news items for KWB. Descriptions of the other components will be reported in separate papers, including Labeled-LDA classifier and content extractions. KWB uses a proprietary summary engine to retrieve a summary of up to 600 characters for each news item.
This paper is organized as follows. In Section 2 we will describe related work. We will describe the architecture of KWB in Section 3, the KWB Crawler Framework for collecting news items in Section 4, and the KWB central database in Section 5. We will present the PopuRank formula in Section 6. In Section 7 we will describe KWB and we will conclude the paper in Section 8.

Web crawling
Web-crawling technologies are important mechanisms for collecting data from the Internet (see, e.g., (Emamdadi et al., 2014;Lin and Bilmes, 2011;Li et al., 2011;Li and Teng, 2010;Zheng et al., 2008)). The general framework of a crawling is given below: 1. Provide the crawler a seed URL.
2. The crawler grabs and stores the target page's content.
3. Enter the URLs contained in the target page in a waiting queue.
4. Process one URL at a time in the queue.

Repeat Steps 2 to 4.
A crawler is responsible for the following tasks:

Ranking of importance and popularity
There are a number of methods to measure the importance and popularity of an object or a person in a network. For example, the Pagerank mechanism measures the influence and popularity of a webpage (Page et al., 1999) and the Erdős' collaboration network (Erdős Number Project, 2010) may be used to measure the impact of collaborators (direct and indirect) of Erdős. These measures, however, do not explicitly consider the effect of time in their ranking. To measure the importance and popularity of news items, we need to consider time explicitly. This calls for a new measure and we will present PopuRank to fill this gap.

KWB Architecture
KWB consists of five components (see Fig. 1): (1) crawlers, (2) central DB, (3) summary engine, (4) core processing unit, and (5) web display. Given below are brief descriptions of each of these components: 1. The crawler component is responsible for collecting news items around the clock from over 120 news websites in mainland China.
2. The central DB is responsible for processing the raw data collected from the crawlers, including removing duplicated news items and fetching summaries for each news article.
3. The summary engine is responsible for returning summaries for each new article with different lengths required by applications. This is preparatory technology.

KWB Crawler Framework
The KWB crawler in our system follows the framework of vertical crawling. It can be reused and customized according to the specific layout of a webpage. We observe that news websites tend to have the same structure: an index page and a number of content pages for news items. When grabbing the index page, we may want to set the crawling depth to 1 to stop the crawler from grabbing the URLs contained in the content page. Meantime, we also want to remove repeating URLs in the URL queue. The KWB crawler framework uses both specific rules and common rules, depending on the individual crawler for a given website.
The KWB crawler framework consists of the following modules (see Fig. 3): 1. Visual input module: This module allows the user to specify the patten of the target webpage's layout. The user may specify two kinds of patterns. The first kind is a regular expression representing what the content the user wants to extract. For example, the regular expression matches the opening and closing pair of a specific HTML tag , within which is content the user wants to extract. The second kind is an XPath structure of the content that the user wants to extract. For example, Suppose that the user wants to select the content enclosed in all the tags. Then the user can specify an XPath query as .
2. Webpage rule management. It manages the webpage rules entered by users, including the following operations: deleting, checking, and updating.
3. The core crawler cluster. This cluster consists of the following components: (1) Thread pool. It is the set of threads in a multitask system.
(2) URL pool. It is the database with all the pending URL information when a URL was grabbed. We use Bloom filter to detect duplicate URLs and remove them. The crawler will visit and remove a URL one at a time from the remaining URLs in this pool. (a) Pattern pool. It is the database of all the webpage rules entered by users. (b) DAO module. DAO (data access object) contains the interface for further operations, including data export and data interface. (c) Duplicate removal. It removes duplicate URLs in the URL pool and the patterns in the pattern pool. (1) Priority processing. Some websites are updated more frequency than the others. This module determines which sites need more frequent visits. (2) Temp grab. Sometimes the user just wants to fetch a website once without paying a return visit. This component handles this type of crawling. (3) Regular grab. For most websites, the user sets up a schedule to grab them periodically. This component handles this type of crawling.
5. The supervision module. This module consists of the following submodules: (1) Resource control (proxy/account). It is a pool containing all the proxy information and account information. The proxy is used to avoid IP blocking problems, and the account is used to log on certain websites that require signing in, such as twitter and facebook.
(2) Monitoring. It monitors if the crawler functions normally. For example, it monitors whether the target website has blocked the crawler. (3) Anti-blocking. When the monitoring submodule detects that a crawler is blocked, it decides whether to restart the crawler, change the pattern, or change proxy to avoid blocking. (4) Managing anti-blocking, exception, and restore rules. This submodule allows the user to manage and change patterns of a website rules. It also determines how often to test if a crawler is still functioning normally.
6. The program entrance. This component consists of a crawler controller/entrance submodule, which is responsible for starting the entire system.
We implemented the KWB crawler framework using Java. We use httpclient to connect to a website and get the DOM tree of the page. We use CSS and Jsoup to parse and extract content. We implemented DAO using mysql and JDBC.

Central Database
Data collected from the KWB crawler are raw data. Although duplicate URLs are eliminated by the crawler, the same news article may be collected from different URLs because the it may be reposted on different websites. For each news article we need to retrieve its summary of different length (depending on applications) using a proprietary Chinese text summary engine. These two processes are time consuming. To reduce computations, we create a new database called central DB (see Fig. 4) to remove duplicates and retrieve summaries for raw data collected in every hour. There are two different types of duplicates in the raw data: (1) exactly the same news items due to reposting; (2) different news items reporting the same news. We will keep the second type of news items, for they report the same event from different angles, which are useful. To identify the first type of duplicates we may compute cosine similarities for all the raw data collected by the KWB crawlers, but this approach is time consuming. Instead, we take a greedy approach to reducing the number of news items that we need to retrieve summaries by eliminating duplicates posted in a small time window. We will further remove duplicates later before computing news classifications.
The central DB retrieves article summaries and detects duplicates in a parallel fashion. In particular, it sorts all the unprocessed raw data in increasing order according to their IDs. These are incremental IDs given to the news items based on the time they are fetched by the KWB crawler framework. Starting from the first news article, repeat the following: 1. Send a request to the summary engine to retrieve summaries of required lengths.
2. Compute the cosine similarities of the article with the news items whose IDs fall in a small fixed time window after this article. If a duplicate is found, remove the one whose ID is in the time window (i.e., with a larger ID), for it is likely a reposting and the news article with a smaller ID may have already had the summaries generated from the summary engine running on a different server.
3. Move to the next news article in the shorted list The index of the news items stored in the central DB contains, among other things, the following four fields: news title, news URL, image URL, first and last sentence of the news content. We further remove news items that match any of these fields for all pairs of news items. In other words, for each pair of news items, if there is a match on any of these four fields, then remove the article with a larger ID.

PopuRank
KWB implements a Labeled-LDA classifier to classify all the news items stored in the central DB. To do so, it needs to segment each news article into a sequence of words, where a word is a sequence of Chinese characters. We show that using Labeled-LDA achieves higher classification accuracy than SVM (Support Vector Machines) for Chinese news items, and we will report this work in a separate paper.
KWB then determines the popularity ranking, called PopuRank, of news items. We observe that the news items that are popular during crawling are indeed the true popular news. In particular, in a given time period, breaking news will be fast reported and reposted online everywhere. In this case, the term frequency (TF) of certain words describing this news will increase sharply. Meanwhile, the document frequency (DF) of certain words describing the breaking news will also increase. We monitor each word (except stop words) in each time frame every day. By monitoring the TF and DF fluctuations of words, KWB calculates PopuRank of the news items collected in each time unit u. The news item with higher PopuRank is more popular. The time unit u may be changed according to the actual needs and user interests. For example, if we want to determine popular news items in each hour, then we may set u to be the unit of hour. The PopuRank of each article remains valid for a fixed number of time frames. For example, we may let = 24 or 48, when u is hour. The value of may also be changed.
Let t v denote the current time frame. Let denote the corpus of all news items collected in this time frame with duplicates removed, where D i is a news article and D i contains N i words in the model of bag of words, denoted by where each word is a segment of two or more Chinese characters after segmentation. We define the following terms: 1. Term frequency (TF). The term frequency of word w j in D i in time frame t v , denoted by tf (w j , D i , t v ), is the number of times it appears in D i , denoted by N ij , divided by N i . That is, 2. Document frequency (DF). The document frequency of word w j in the corpus D v , denoted by df (w j , D v ), is defined as the total number of documents in D v that contain w j , denoted by N j , divided by the total number of words in D v , denoted by N . That is, 3. Average term frequency (ATF). Let atf (w j , D v ) denote the average term frequency of word w j in corpus D v . That is, 4. Term rank (TR). We define the term rank of word w j in document D i in time frame t v , denoted by tr(w j , D i , t v ), as follows: where α ≥ 0, β ≥ 0, and α + β = 1. For example, we may let α = 0.6 and β = 0.4 to indicate that we place more weight on term frequency over document frequency.
For each word w j appearing in D v , compute df (w j , D v ) and atf (w j , D v ), and keep them for number of time frames. We now define PopuRank of a document. Assume that word w j appears in the current time frame t v . Let T denote the following sequence of consecutive time frames, called a window: At each time frame in this window, we monitor the DF and ATF values for each word. Let t v be the current time frame. For each word w j in D v , we have the following two cases: Case 1 : w j is a new word, that is, it did not appear in the previous time frames in the window T , then we compute the TF-IDF values of all the new words in this time frame and mark the top d percent of the new words as popular words.
Case 2 : w j is not a new word. Compute atf (w j , t v ) and df (w j , t v ). If the ATF and DF values of word w j at time t v suddenly increase k 1 and k 2 times over the previous average ATF and DF values, respectively, for word w j , denoted by avgAT F (w j , t v ) and avgDF (w j , t v ), then we will consider the word w j a popular word, where To specify the values of k 1 and k 2 , let where δ and σ are threshold values, then we say that word w j is popular in time frame t v .
Let H v denote the set of all popular words in time frame t v . We define the PopuRank of news article D i ∈ D v to be the sum of term rank of the popular words in D i in time frame t v . Namely, Parameters α and β is related to TR and Popu-Rank. The value of α and β are decided by which character, TF or DF, is regarded more important. The Fig. 7 shows the TR of a particular word with different α. Meanwhile, since TR varies, Popu-Rank of the news also varies, the Fig. 8 shows the different PopuRank of one news with different α and β in same time frame.
Threshold δ and σ decide the numbers of popular words, Fig. 9 shows that the numbers of popular words decrease when δ and σ increase, δ and σ have same value in Fig. 9.
The running time of calculating PopuRank on news items in each time frame depends on the numbers of news items waiting to be processed. Table 1 shows the number of news items in each time frame on an average day and the time to compute PopuRank of all news items in each time frame on a server running QEMU Virtual CPU version 1.2.0 with 2.6 GHz and 16 GB RAM.

Web Displays of KWB
KWB is an automated quick news system that collects news items real-time from all major Chinese news websites, classifies the news items into 19 categories, and displays on http://www.kuaiwenbao.com news items in each category with summaries and pictures, sorted according to their PopuRank values. We have also implemented KWB in mobile apps (An- Figure 9: No. of popular words with different values of δ/σ droid App may be downloaded by entering http://www.kuaiwenbao.com/kuaiwenbao.apk on a web browser of an Android phone). Fig. 10 depicts the web display of KWB, where the lefthand panel is a menu bar of news titles and picture thumbnails. The user simply points their mouse to a particular news title to see the original picture and the summary of of the news items on the right-hand panel. The reader may also click the "read the original" button to the URL of the original news article and read it.
KWB classifiers all news items into 19 categories. Users may click the menu icon on the upper-left corner to display the menu of categories and select a particular category of interests. Fig.  11 depicts the category menu.