RAMBLE ON: Tracing Movements of Popular Historical Figures

We present RAMBLE ON, an application integrating a pipeline for frame-based information extraction and an interface to track and display movement trajectories. The code of the extraction pipeline and a navigator are freely available; moreover we display in a demonstrator the outcome of a case study carried out on trajectories of notable persons of the XX Century.


Introduction
At a time when there were no social media, emails and mobile phones, interactions were strongly shaped by movements across cities and countries. In particular, the movements of eminent figures of the past were the engine of important changes in different domains such as politics, science, and the arts. Therefore, tracing these movements means providing important data for the analysis of culture and society, fostering so-called cultural analytics (Piper, 2016). This paper presents RAMBLE ON, a novel application that embeds Natural Language Processing (NLP) modules to extract movements from unstructured texts and an interface to interactively explore motion trajectories. In our use case, we focus on biographies of famous historical figures from the first half of the XX Century extracted from the English Wikipedia. A web-based navigator 1 related to this use case is meant for scholars without a technical background, supporting them in discovering new cultural migration patterns with respect to different time periods, geographical areas and domains of occupation. We also release the script to generate trajectories and a stand-alone version of the RAMBLE ON navi-1 Available at http://dhlab.fbk.eu/rambleon/ gator 2 , where users can upload their own set of movements taken from Wikipedia biographies.

Related Work
The analysis of human mobility is an important topic in many research fields such as social sciences and history, where structured data taken from census records, parish registers, mobile phones etc. are employed to quantify travel flows and find recurring patterns of movements (Pooley and Turnbull, 2005;Gonzalez et al., 2008;Cattuto et al., 2010;Jurdak et al., 2015). Other studies on mobility rely on a great amount of manually extracted information (Murray, 2013) or on shallow extraction methods. For example Gergaud et al. (2016) detect movements in Wikipedia biographies assuming that cities linked in biography pages are locations where the subject lived or spent some time.
However we believe that, even if NLP contribution has been quite neglected in cultural analytics studies, language technologies can greatly support this kind of research. For this reason, in RAMBLE ON we combine state-of-the-art Information Extraction and semantic processing tools and display the extracted information through an advanced interactive interface. With respect to previous work, our application allows to extract a wide variety of movements going beyond the birth-to-death migration that is the focus of Schich et al. (2014) or the transfers to the concentration camps of deportees during Nazism as in Russo et al. (2015).

Information Extraction
In Figure 1, we show the general NLP workflow behind information extraction in RAMBLE ON. The goal is to obtain, starting from an unstructured text, a set of destinations together with their coordinates and a date, each representative of the place where a person moved or lived at the given timepoint.
Input Data In our approach information extraction is performed on Wikipedia biographical pages. In the first step, these pages are cleaned up by removing infoboxes, tables and tags, keeping only the main body as raw text.
Pre-processing Raw text is processed using PIKES (Corcoglioniti et al., 2015), a suite of tools for extracting frame oriented knowledge from English texts. PIKES integrates Semafor (Das et al., 2014), a system for Semantic Role Labeling based on FrameNet (Baker et al., 1998), whose output is used to identify predicates related to movements and their arguments because its high-level organization in semantic frames is an useful way to generalize over predicates. PIKES also includes Stanford CoreNLP (Manning et al., 2014). Its modules for Named Entity Recognition and Classification (NERC), coreference resolution and recognition of time expressions are used to detect for each text: (i) mentions related to the person who is the subject of the biography; (ii) locations and organizations that can be movement destinations; (iii) dates.
Frame Selection Starting from the frames related to the Motion frames in FrameNet and a manual analysis of a set of biographies, we identified 45 candidate frames related to direct (e.g. Departing) or indirect (e.g. Residence) movements of people. After a manual evaluation of these 45 frames on a set of biographies annotated with PIKES, we removed 16 of them from the list of candidate frames because of the high number of false positives. These include for example Escaping, Getting underway and Touring. Combin-  ing the information from the CoreNLP modules in PIKES with the remaining 29 frames listed in Table 1, our application extracts a list of candidate sentences, containing a date and a movement of the subject together with a destination. These represent the geographical position of a person at a certain time.
Georeferencing To georeference all the destinations mentioned in the candidate sentences RAMBLE ON uses Nominatim 5 . Due to errors by the NERC module (e.g., Artaman League annotated as geographical entity), some destinations can lack coordinates and thus are discarded. Moreover, for each biography, the places and dates of birth and death of the subject are added as taken from DBpedia.

Ramble On Navigator
Movements as extracted with the procedure described in Section 3 are graphically presented on an interactive map that visualizes trajectories between places. The interface, called RAMBLE ON Navigator, is built using technology based on web standards (HTML5, CSS3, Javascript) and open source libraries for data visualization and geographical representation, i.e. d3.js and Leaflet. Through this interface, see Figure 2, it is possible to filter the movements on the basis of the time span or to search for a specific individual. Moreover, if information about nationality and domain of occupation is provided in the JSON files, the Navigator allows to further filter the search. Hovering the mouse on a trajectory, the snippet of text from which it was automatically extracted appears on the bottom left. Information about all the movements related to a place is displayed when hovering on a spot on the map. The trajectories have an arrow indicating the route destination and are dashed if the movement described by the snippet is started before the selected time span. The online version of the Navigator shows the output of the case study presented in Section 5, while the stand-alone application also allows to upload another set of data.

Case Study
We relied on the Pantheon dataset (Yu et al., 2016) to identify a list of notable figures to be used in our case study. We chose Pantheon since it provides a ready-to-use set of people already classified into categories based on their domain occupation (e.g., Arts, Sports), birth year, nationality and gender. More specifically, we considered 2,407 individuals from Europe and North America living between 1900 and 1955. First we downloaded the corresponding Wikipedia pages, as published in April 2016, collecting a corpus of more than 7,5 million words. Then we used the workflow described in Section 3 and we enriched output data with the categories taken from Pantheon. We manually refined the output by removing the sentences wrongly identified as movements (14.02%), for example those not referring to the subject of the biography (e.g., When communist North Korea invaded South Korea in 1950, he sent in U.S. troops). The final dataset resulted in 2,929 sentences from 1,283 biographies, since 1,124 individuals had no associated movements. This may be due to either  an actual lack of sentences concerning movements or errors in the automatic processing, e.g., missed identification of places or dates.

Conclusion and Future Work
We presented an automatic approach for the extraction and visualisation of motion trajectories, which is easy to extend to different datasets, and that can provide insights for studies in many fields, e.g., history and sociology.
In the future, we will mainly focus on improving the system coverage. Currently, missing trajectories are mainly due to (i) the presence of predicates not recognized as lexical units in FrameNet, e.g. exile; (ii) the lack of information in the English Wikipedia biography, and (iii) the presence of sentences with complex temporal structures, e.g., Cummings returned to Paris in 1921 and remained there for two years before returning to New York. These issues can be dealt with by adding missing predicates to FrameNet, extend Pikes to other languages and experimenting with different systems for temporal information processing (Llorens et al., 2010). We also plan to apply the methodology presented in (Aprosio and Tonelli, 2015) to automatically recognize the Wikipedia text passages dealing with biographical information, so to discard sections containing useless information.