The BreakingNews Dataset

We present BreakingNews, a novel dataset with approximately 100K news articles including images, text and captions, and enriched with heterogeneous meta-data (e.g. GPS coordinates and popularity metrics). The tenuous connection between the images and text in news data is appropriate to take work at the intersection of Computer Vision and Natural Language Processing to the next step, hence we hope this dataset will help spur progress in the field.


Introduction
Current successes in the crossroads between NLP and computer vision indicate that the techniques are mature for more challenging objectives than those posed by existing datasets. The NLP community has been addressing tasks such as sentiment analysis, popularity prediction, summarization, source identification or geolocation to name a few, that have been relatively little explored in computer vision. BreakingNews is a large-scale dataset 1 of news articles with rich meta-data and, we believe, an excellent benchmark for taking joint vision and language developments a step further. In contrast to existing datasets, the link between images and text in BreakingNews is not as direct, i.e., the objects, actions and attributes of the images may not explicitly appear as words in the text (see example in Fig. 1). The visuallanguage connections are more subtle and learning them will require the development of new inference tools able to reason at a higher and more abstract level. Furthermore, besides tackling article illustration or image captioning tasks, the * denotes equal contribution 1 http://www.iri.upc.edu/people/ aramisa/BreakingNews/index.html proposed dataset is intended to address new challenges, such as source/media agency detection, estimation of GPS coordinates, or popularity prediction (which we annotate based on the reader comments and number of re-tweets).
In (Ramisa et al., 2016) we present several baseline results for different tasks using this dataset.

Description of the Dataset
The BreakingNews dataset consists of approximately 100,000 articles published between the 1st of January and the 31th of December of 2014. All articles include at least one image, and cover a wide variety of topics, including sports, politics, arts, healthcare or local news.
The main text of the articles was downloaded using the IJS newsfeed (Trampuš and Novak, 2012), which provides a clean stream of semantically enriched news articles in multiple languages from a pool of rss feeds.
We restricted the articles to those that were written in English, contained at least one image, and originated from a shortlist of highly-ranked news media agencies (see Table 1) to ensure a degree of  consistency and quality. Given the geographic distribution of the news agencies, most of the dataset is made of news stories in English-speaking countries in general, and the UK in particular. For each article we downloaded the images, image captions and user comments from the original article webpage. News article images are quite different from those in existing captioned images datasets like Flickr8K (Hodosh et al., 2013) or MS-COCO (Lin et al., 2014): often include close-up views of a person (46% of the pictures in BreakingNews contain faces) or complex scenes. Furthermore, news image captions use a much richer vocabulary than in existing datasets (e.g. Flickr8K has a total of 8,918 unique tokens, while eight thousand random captions from BreakingNews already have 28,028), and they rarely describe the exact contents of the picture.
We complemented the original article images with additional pictures downloaded from Google Images, using the full title of the article as search query. The five top ranked images of sufficient size in each search were downloaded as potentially related images (in fact, the original article image usually appears among them).
Regarding measures of article popularity, we downloaded all comments in the article page and the number of shares on different social networks (e.g. Twitter, Facebook, LinkedIn) if this information was available. Whenever possible, in addition to the full text of the comments, we recovered the thread structure, as well as the author, publication date, likes (and dislikes) and number of replies. Since there were no share or comments information available for "The Irish Independent", we searched Twitter using the full title and collected the tweets that mentioned a name associated with the newspaper (e.g. @Independent_ie, Irish Independent, @IndoBusiness) or with links to the original article in place of comments. We considered the collective number of re-tweets as shares of the article. The IJS Newsfeed annotates igure 2: Ground truth geolocations of articles.
the articles with geolocation information both for the news agency and for the article content. This information is primarily taken from the provided RSS summary, but sometimes it is not available and then it is inferred from the article using heuristics such as the location of the publisher, TLD country, or the story text. Fig. 2 shows a distribution of news story geolocation. Finally, the dataset is annotated for convenience with shallow and deep linguistic features (e.g. part of speech tags, inferred semantic topics, named entity detection and resolution, sentiment analysis) with XLike 2 and Enrycher 3 NLP pipelines.