What to Write? A topic recommender for journalists

In this paper we present a recommender system, What To Write and Why, capable of suggesting to a journalist, for a given event, the aspects still uncovered in news articles on which the readers focus their interest. The basic idea is to characterize an event according to the echo it receives in online news sources and associate it with the corresponding readers’ communicative and informative patterns, detected through the analysis of Twitter and Wikipedia, respectively. Our methodology temporally aligns the results of this analysis and recommends the concepts that emerge as topics of interest from Twitter andWikipedia, either not covered or poorly covered in the published news articles.


Introduction
In a recent study on the use of social media sources by journalists (Knight, 2012) the author concludes that "social media are changing the way news are gathered and researched". In fact, a growing number of readers, viewers and listeners access online media for their news (Gloviczki, 2015). When readers feel involved by news stories they may react by trying to deepen their knowledge on the subject, and/or confronting their opinions with peers. Stories may then solicit a reader's information and communication needs. The intensity and nature of both needs can be measured on the web, by tracking the impact of news on users' search behavior on on-line knowledge bases as well as their discussions on popular social platforms. What is more, on-line public's reaction to news is almost immediate (Leskovec et al., 2009) and even anticipated, as for the case of planned media events and performances, or for disasters (Lehmann et al., 2012). Assessing the focus, duration and outcomes of news stories on public attention is paramount for both public bodies and media in order to determine the issues around which the public opinion forms, and in framing the issues (i.e., how they are being considered) (Brooker and Schaefer, 2005). Futhermore, real-time analysis of public reaction to news items may provide useful feedback to journalists, such as highlighting aspects of a story that needs to be further addressed, issues that appear to be of interest for the public but have been ignored, or even to help local newspapers echo international press releases.
The aim of this paper is to present a news media recommender, What to Write and Why (W 3 ), for analyzing the impact of news stories on the readers, and finding aspects -still uncovered in news articles -on which the public has focused their interest. The purpose of W 3 is to support journalists in the task of reshaping and extending their coverage of breaking news, by suggesting topics to address when following up on such news items. For example, we have found that a common pattern for news readers is to search events of the same type occurred in the past on Wikipedia, which is not surprising per se: however, among the many possible similar events, our system is able to identify those that the majority of readers consider (sometimes surprisingly) highly associated with breaking news, e.g., searching for the 2013 CeaseFire program in Baltimore during Egypt's ceasefire proposal in Gaza on July 2014.

Methodology
Our methodology is in five steps, as shown in the workflow of Figure 1: Step 1. Event detection: We use SAX*, an unsupervised temporal mining algorithm that we introduced in (Stilo and Velardi, 2016), to cluster tokens -words, entities, hashtags, page viewsbased on the shape similarity of their associated signals s(t). In SAX*, signals observed in temporal windows L k are first transformed into strings of symbols of an alphabet Σ; next, strings associated to active tokens (those corresponding to patterns of public attention) are clustered based on their similarity. Each cluster is interpreted as related to an event e i . Clusters are extracted independently from on-line news (N ), Twitter messages (T ) and Wikipedia page views (W ). For example, the cluster in Figure 2 shows Wikipedia page views related to the Malaysia Airline crash on July 2014. We remark that SAX* blindly clusters signals without prior knowledge of the event and its occurrence date, and furthermore, it avoids time-consuming processing of text strings, since it only considers active tokens.
Step 2. Intra-source clustering: Since clusters are generated in sliding windows L k of equal length L and temporal increment ∆, clusters referring to the same event but extracted in partly overlapping windows may slightly differ, especially for long-lasting events, when news updates motivate the emergence of new sub-topics and the decay of others. An example is in Figure 3, showing for simplicity a cluster with a unique signal s(t) which we can also interpret as the cluster centroid. The Figure also shows the string of symbols a a a a b b a a a a a   a a a a a a a a a b b   a b b a a a a a a b a   a a a a a b a a a a a   a a a a a a a  For a better characterization of an event, we merge clusters referring to the same event and extracted in adjacent windows, based on their similarity. Merged clusters form meta-clusters, denoted with m S i , where the index i refers to the event and S ∈ {N, T, W } to the data source. With reference to Figure 3, the signals in windows 1, 2, 3 and 4 would be merged, but not the signal in window 5. An example from the T dataset is shown in Table  1: note that the first two clusters show that initially Twitter users where concerned mainly about the tragedy (clusters C9 and C5), and only later did their interest focus on political aspects (e.g., Barack Obama, Vladimir Putin in C17 and C18).
Step 3. Inter-source alignment: Next, an alignment algorithm explores possible matches across the three data sources N , T and W . For any event e i , we eventually obtain three "aligned" meta-clusters m N i , m T i and m W i mirroring respectively the media coverage of the considered event and its impact on readers' communication and information needs.
Step 4. Generating a recommendation: The input to our recommender is the news metaclusters m N i related to an event e i first reported on day d 0 and extracted during an interval I : where d 0+x is the day in which the query is performed by the journalist. The system compares the three aligned meta-clusters , representing, respectively, the event-related topics already discussed, and those not yet considered in news items. The first set is interesting for journalists in order to understand which topics mostly attracted the attention of the public, while the second set includes event-related, but still uncovered, topics that W 3 recommends to discuss. For example, the following is a recommendation generated from the analysis of Wikipedia page views, related to Scottish Independence elections on September 17th, 2014: [scotland, wales, alex salmond, united kingdom, scottish national party, flag of scotland, william wallace, countries of the united kingdom, mary queen of scots, tony blair, braveheart, flag of the united kingdom, republic of ireland].
When comparing these entities with the aligned news meta-clusters, the set of novel entities R novel i is: [flag of scotland, william wallace, countries of the united kingdom, mary queen of scots, tony blair, braveheart] and all the others are also found in news.
Step 5. Classification of information and communication needs: In addition to recommendations, we automatically assign a category both to event clusters m N i in news, and to related entities in Twitter and Wikipedia aligned metaclusters m T i and m W i , in order to detect recurrent discussion topics and search patterns in relation to specific event types. To do so, we exploit both BabelNet (Navigli and Ponzetto, 2010

Discussion
To conduct our study, we created three datasets: Wikipedia PageViews (W), On-line News (N) and Twitter messages (T). Data was collected during 4 months from June 1st, 2014 to September 30th. Table 2 shows some statistics. Note that Wikipedia clusters are smaller, since cluster members are only named entities (page views). We defined the following evaluation framework: i) Given an event e i and related news n i ∈ N i , we generate recommendations as explained in Step 4, in a selected interval prior to the day of the query. ii) Automated evaluation: we we select the top K scored recommendations and measure the saliency of R in news i and serendipity of R novel i in an automated fashion, and we compare the performance against a primitive recommender, in analogy with (Murakami et al., 2008) and (Ge et al., 2010); ii) Manual evaluation: we select the top K scored recommendations in R novel i for a restricted number of 21 high-impact world-wide events, and we perform manual evaluation using the Crowdflower.com platform, providing detailed evaluation guidelines for human annotators. Using this ground truth, we measure the global serendipity of W 3 recommendations.

Automated Evaluation
We first build two primitive recommenders (PRs) for Wikipedia and Twitter, which we use as a baseline. The input to a PR is the same as for W 3 (see Step 4). Wikipedia PR: The Wikipedia PR is based on finding connected components of the Wikipedia hyperlink page graph (like in (Hu et al., 2009)), when considering only the topmost visited pages in a temporal slot. More precisely, for each day d in the interval I : 3 , we select the top H ≥ K visited named entities of the day E W d . Entities are ranked by frequency of page views 4 . Next, we create clusters c d j obtained by extracting the connected components of E W d in the Wikipedia hyperlink graph. Let C I be the set of all clusters c I j in I . From this set, we select the top r clusters based on the Jaccard similarity with news meta-clusters m N i . A "primitive" recommendation for event e i on day d 0+x is the set P R W i of topmost K ranked entities in the r previously selected clusters. Like in W 3 recommendations, P R W i is a ranked list of entities some of which are also found in m N i , and some others are novel. Twitter PR: For each entity e ∈ m N i we retrieve and recommend the top K co-occurring entities in tweets in the considered interval.
Note that both primitive recommenders are far from being naive. A hyperlink graph to characterize users' intent in Wikipedia search is used in (Hu et al., 2009) (although the authors use Random Walks rather than connected components analysis to identify related pages). Co-occurrences with top ranked news terms has been used in (Weiler et al., 2014) to track on Twitter the evolution and the context around events. We generate recommendations using four systems: W 3 (T ), W 3 (W ), P R(T ) and P R(W ). The first two originate from What To Write and Why when applied to Twitter and Wikipedia, respectively. The second two are generated by the two primitive recommenders described above. For all systems, we consider the first K top ranked entities, as we said.
To assess the quality of "not novel" recommended entities in W 3 (and similarly for the other systems), for any r j ∈ R in news i we retrieve all the 3 Since rumors on an event can be anticipated wrt the day d0 in which the first news item is published 4 Note that E W d could be straightly used for recommendation, however it would be an excessively rough strategy. news N i related to m N i meta-clusters, and compute the saliency of r j as follows: where n i ∈ N i , occ title (r j , n i ) is the number of occurrences of r j in the title of n i , while occ snip (r j , n i ) is the number of occurrences of r j in the text snippet of n i and β has been experimentally set to 0.7. The intuition is that recommended entities in R in news i are salient if they frequently occur in the title and text of news snippets, where occurrences in the title have a higher weight. The total saliency of r j is then: where IDF (r j ) is the inverse document frequency of r j in all news of the considered temporal slot, and is used to smooth the relevance of terms with high probability of occurrence in all documents. The average saliency of R in news i is: To provide an estimate of the serendipity of novel recommendations, we compute the NASARI similarity (Camacho-Collados et al., 2016) of entities r k ∈ R novel i with in-news entities r j ∈ E N i and we weight these values with the saliency of r j . The intuition is that serendipitous recommendations are those concerning topics which have not been discussed so far in on-line news, but are highly semantically related with highly salient topics in news: Note that this formulation is not conceptually different from other measures used in literature (e.g, (Tran et al., 2015), (Murakami et al., 2008)), that commonly assign a value to novel recommendations proportionally to their relevance and informativeness, however given the absence of prior knowledge on users' choices, we assume that semantic similarity with salient entities in news items is a clue for relevance.
In Table 3 we summarize the results of our experiments, that we run over the full dataset (see  Table 2). We set the maximum number of provided recommendations K = 10 for Wikipedia (where clusters are smaller) and K = 50 for Twitter. All recommendations are gathered either the same day (d 0 ) of the first news item on the event e i , or two days after (d 2 = d 0 + 2).
In analogy with (Murakami et al., 2008) and (Ge et al., 2010), we show the percentage difference in performance between W 3 and Primitive Recommenders (PRs). Besides saliencey and serendipity, we also compute the harmonic mean between the two (the F value). The Table shows that for Wikipedia, W 3 outperforms the PR both in saliency and serendipity (it is up to 656% more serendipitous than the baseline) while in Twitter, W 3 shows better serendipity (+91%) but lower salience (-28%). Comparatively, the performance of W 3 is much better on Wikipedia than on Twitter, probably due to the limited evidence provided by the 1% available traffic. We also noted that two days after the main event (x=2), both serendipity and saliency only slightly decrease showing that newswires have covered only a small portion of users' communication and information needs.

Manual Evaluation
In manual evaluation, in order to start from a clean representation of each event for all systems, we selected 21 relevant (with topmost number of news, tweets and wikipedia views) events in the considered 4-months period, and we manually identified the relevant news items N i for each event e i in a ± 1-day interval around the event peak day d 0 . An excerpt of 5 events is shown in Table 4. We then automatically extracted named entities from these news items.
For each of the four systems W 3 (T ), W 3 (W ), P R(T ) and P R(W ) and each event e i , we generate the first K = 5 novel recommendations, and we use the CrowdFlower.com platform to assess the relevance of these recommendations 5 . For each item of news, annotators are asked to decide 5 The saliency of R in news i is well assessed by formula (2) if an entity IS or IS NOT relevant with reference to the reported news ("not sure" is also allowed). "Relevant" means that either the entity is semantically related to the domain of the news, or that it is factually related. The task was run on April 23rd, 2017, and we collected 1344 total judgements. To compute the performance of each system, we use the Mean Average Precision (MAP) 6 , which takes into account the rank of recommendations. The results are shown in Table 5, which shows, in agreement with the automated evaluation of Table 3, a superiority of W 3 and also confirms that the difference between W 3 and the primitive recommender is much higher in Wikipedia than in Twitter. We also note that the absolute performance of the recommender is higher in Twitter, which is not in contradiction with Table 3, since here we are focusing on world-wide high impact news, those for which our 1% Twitter stream provides sufficient evidence to obtain clean clusters, such as those in Table 1.

Analysis of Information Needs
To analyze readers' behavior more systematically, we classified events meta-clusters automatically, extending the work in (Košmerlj et al., 2015), were the authors have manually classified 13,883 Wikipedia event-related articles in 9 categories. Furthermore, we classified recommendations, i.e., tokens in m T i and m W i meta-clusters associated to each event e i , using BabelNet hypernymy (IS A) relations 7 , and their mapping onto Wikipedia Categories. In Figure 4 we plot the category distribution of Wikipedia articles (more specifically, we plot only novel recommendations extracted by W 3 ) that readers have accessed in correspondence of different event types. The Bubble plot shows several interesting patterns: for example,  Religion is the main searched category for events classified as Armed Conflicts and Attacks, mirroring the fact that religion is perceived as being highly related with latest world-wide conflicts. Accordingly, users try to deepen their knowledge on these aspects. Disasters and accidents mostly include members in the same Wikipedia category (Disasters) and also Aircraft, since the Malaysia crash was the dominating event in the considered period. Business and Economy draw the attention of readers mostly when related to Technology, e.g., new devices being launched. Law and Crime events induce in readers the need to find out more about specific laws and treaties (the category Documents). Finally, we note that Sport is the event category showing the highest dispersion of information needs. While many of the bubbles in Figure 4 indeed show real information needs (e.g, VideoGames refers to the many sport games launched on the market, Model (person) refers to gossip about football players, and in general all people and media related categories refer to the participation of celebrities in sporting events), a number of bubbles can be considered as noise, e.g., Literature, Politics. In fact, Sport was the dominating event type during the considered period (2014 World Football Cup), therefore it is reasonable that sport-related clusters are those cumulating the highest number of system errors.