Learning Prototypical Event Structure from Photo Albums

Activities and events in our lives are structural, be it a vacation, a camping trip, or a wedding. While individual details vary, there are characteristic patterns that are speciﬁc to each of these scenarios. For example, a wedding typically consists of a sequence of events such as walking down the aisle, exchanging vows, and dancing. In this paper, we present a data-driven approach to learning event knowledge from a large collection of photo albums. We formulate the task as constrained optimization to induce the prototypical temporal structure of an event, integrating both visual and textual cues. Comprehensive evaluation demonstrates that it is possible to learn multimodal knowledge of event structure from noisy web content.


Introduction
Many common scenarios in our lives, such as a wedding or a camping trip, show characteristic structural patterns. As illustrated in Figure 1, these patterns can be sequential, such as in a wedding, where exchanging vows generally happens before cutting the cake. In other scenarios, there may be a set of composing events, but no prominent temporal relations. A camping trip, for example, might include events such as hiking, which can happen either before or after setting up a tent.
This observation on the prototypical patterns in everyday scenarios goes back to early artificial intelligence research. Scripts (Schank and Abelson, 1975), an early formalism, were developed to encode the necessary background knowledge to support an inference engine for common sense reasoning in limited domains. However, early ap- ios (e.g., weddings) and cluster their images and captions to learn the hierarchical events that make up these scenarios. We use constrained optimization to decode the temporal order of these events, and we extract the prototypical descriptions that define them.
proaches based on hand-coded symbolic representations proved to be brittle and difficult to scale. An alternative direction in recent years has been statistical knowledge induction, i.e., learning script or common sense knowledge bottom-up from large-scale data. While most prior work is based on text (Pichotta and Mooney, 2014;Jans et al., 2012;Chambers and Jurafsky, 2008;Chambers, 2013), recent work begins exploring the use of images as well (Bagherinezhad et al., 2016;Vedantam et al., 2015).
In this paper, we present the first study for learning knowledge about common life scenarios (e.g., weddings, camping trips) from a large collection of online photo albums with time-stamped images and their captions. The resulting dataset includes 34,818 time-stamped photo albums corresponding to 12 distinct event scenarios with 1.5 million images and captions (see Table 1 for more details).
We cast unsupervised learning of event structure as a sequential multimodal clustering prob-lem, which requires solving two subproblems concurrently: identifying the boundaries of events and assigning identities to each of these events. We formulate this process as constrained optimization, where constraints encode the temporal event patterns that are induced directly from the data. The outcome is a statistically induced prototypical structure of events characterized by their visual and textual representations.
We evaluate the quality and utility of the learned knowledge in three tasks: temporal event ordering, segmentation prediction, and multimodal summarization. Our experimental results show the performance of our model in predicting the order of photos in albums, partitioning photo albums into event sequences, and summarizing albums.

Overview
The high-level goal of this work is unsupervised induction of the prototypical event structure of common scenarios from multimodal data. We assume a two-level structure: high-level events, which we refer to as scenarios (e.g., wedding, funeral), are given, and low-level events (e.g., dance, kiss, vows), which we refer to as events, are to be automatically induced. In this section, we provide the overview of the paper (Section 2.1), and introduce our new dataset (Section 2.2).

Approach
Given a large collection of photo albums corresponding to a scenario, we want to learn three aspects of event knowledge by (1) identifying events common to the given scenario (Section 4.1), (2) learning temporal relations across events (Section 4.2), and (3) extracting prototypical captions for each event (Section 4.3).
To induce the prototypical event structure, an important subproblem we consider is individual photo album analysis, where the task is (1) partitioning each photo album into a sequence of segments, and (2) assigning the event identity to each segment. We present an inference model based on Integer Linear Programming (ILP) in Section 3 to perform both segmentation and event identification simultaneously, in consideration of the learned knowledge that we describe in Section 4.

Dataset
For this study, we have compiled a new corpus of multimodal photo albums across 12 distinct scenarios. It comprises of 34,818 albums containing 1.5 million pairs of online photographs and their textual descriptions. Table 1 shows the list of scenarios and the corresponding data statistics. We choose six scenarios (the top half of Table 1) that we expect have an inherent temporal event structure that can be learned and six that we expect do not (the bottom half of Table 1). The dataset is collected using the Flickr API 1,2 . We use the scenario names and variations of them (e.g., Paris Trip, Paris Vacation) as queries for images. We then form albums from these images by grouping images by user, sorting them by timestamp, and extracting groups that are within a contained time frame (e.g., 24 hours for a wedding, 5 days for a trip). For all images, we extract the first sentences of the corresponding textual descriptions as captions and also store their timestamps. This data is publicly available at https://www.cs.washington.edu/ projects/nlp/protoevents.

Inference Model for Multimodal Event Segmentation and Identification
Given a photo album, the goal of the inference is to assign events to photos and to segment albums by event. More formally, given a sequence of M photos P = {p 1 , . . . , p M }, and N learned events E = {e 1 , . . . , e N }, the task is to assign each photo to a  Figure 2: The events learned in Section 4 are assigned to photos based on textual (A c ) and visual (A v ) affinities, which encode how well a photo represents an event (φevent). Segmentation scores (φseg) between adjacent photos encourage similar photos to be assigned the same event. Local transition, PL, and global pairwise ordering, PG, probabilities encode the learned temporal knowledge between events. φ temporal encourages event assignments toward a learned temporal structure of the scenario.
single event. The event assignment can be viewed as a latent variable for each photo. We formulate a constrained optimization (depicted in Figure 2) that maximizes the objective function, F , which consists of three scoring components: (a) event assignment scores φ event (Section 3.1), (b) segmentation scores φ seg (Section 3.2), and (c) temporal knowledge scores φ temporal (Section 3.3): Decision Variables. The binary decision variable X i,k indicates that photo p i is assigned to event e k . The binary decision variable Z i,j,k,l indicates that photos p i and p j are assigned to events e k and e l , respectively:

Event Assignment Scores
Event assignment scores quantify the textual and visual affinity between a photo p i and an event e k . Affinities are measures of representation similarity between photos and events. These scores push photos displaying a certain event to be assigned to that event. For now we assume the textual affinity matrix A c ∈ [0, 1] M ×N and the visual affinity matrix A v ∈ [0, 1] M ×N are given. We describe how we obtain these affinity matrices in Section 4.1. Event assignment scores are defined as the weighted sum of both textual and visual affinity: where X i,k is a photo-event assignment decision variable, and γ ce and γ ve are hyperparameters that balance the contribution of both affinities.

Segmentation Scores
Segmentation scores quantify textual and visual similarities between adjacent photos. These scores encourage similar adjacent photos to be assigned to the same event. We define a similarity score between adjacent photos equal to the weighted sum of their textual (b c ) and visual (b v ) similarities: are vectors of textual and visual similarity scores between adjacent photos whose i th element corresponds to the similarity score between photos p i and p i+1 , Z is a decision variable defined by Equation 2, and γ cs and γ vs are hyperparameters balancing the contribution of both types of similarity. The similarity scores in the b vectors are computed using cosine similarity of the feature representations of adjacent images in both the textual and visual modes.

Temporal Knowledge Scores
Temporal knowledge scores quantify the compatibilities across different event assignments in terms of their relative ordering. For now, we assume two types of temporal knowledge matrices are given: L ∈ [0, 1] N ×N which stores local transition probabilities for every pair of events, e k and e l , and G ∈ [0, 1] N ×N which stores global pairwise ordering probabilities for every pair of events, e k and e l . We describe how we obtain these temporal knowledge matrices in Section 4.2. The temporal knowledge score, defined below, encourages the inference model to assign events that are compatible with the learned temporal knowledge: where Z is a decision variable defined by Equation 2, and γ lp and γ gp are hyperparameters that balance the contribution of local and global temporal knowledge in the objective.

Constraints
We include hard constraints that force each photo to be assigned to exactly one event: The number of these constraints is linear in the number of photos in an album. We also include hard constraints to ensure consistencies among binary decision variables X and Z: which states that Z i,j,k,l can be 1 only if both X i,k and X j,l are 1. The number of constraints for segmentation scores and local transition probabilities is O(M N 2 ) because they model interactions between adjacent photos for all event pairs. The number of these constraints for global pairwise ordering probabilities is O(M 2 N 2 ) because they model interactions between all pairs of photos in an album for all event pairs.

Learned Event Knowledge
We learn base events for each scenario by clustering photos from training albums related to that scenario ( Figure 3). As described in Section 3, these base events and their temporal knowledge are incorporated in a joint model for event induction in unseen albums.

Learned Event Representation
We perform k-means clustering over captions to create a base event model. We perform text-only clustering at first since visual cues are significantly -Arc de Triomphe -At the Arc.

Louvre Arc de Triomphe
-Louvre Square.
-Pyramid of the Louvre.

Local transition probabilities
Global pairwise ordering probabilities P G (e 5 e 1 ) = .47 eiffel arc de triomphe P G (e 2 e 4 ) = .48 louvre notre dame P G (e 3 e 2 ) = .68 invalides louvre P L (e 1 e 5 ) = .24 arc de triomphe eiffel P L (e 4 e 3 ) = .01 notre dame invalides P L (e 2 e 5 ) = .12 louvre eiffel Center, e C Figure 3: Photos are clustered by their captions. We can compute the visual, e v k , and caption, e c k , centers for all the clusters, as well as the local transition, PL, and global pairwise ordering, PG, probabilities between these events based on the sequential patterns they exhibit in the training set.
noisier. Because not all photos have informative captions, it is expected that this base clustering will form meaningful clusters only over a subset of the data. For each scenario, the largest cluster corresponds to the "miscellaneous" cluster as the captions in it tend to be relatively uninformative about specific events. This cluster is excluded when computing temporal knowledge probabilities (Section 4.2).
The visual and textual representations of an event are computed using the average of the visual and textual features, respectively, of photos assigned to that event. We compute each textual affinity A c i,k in the event assignment scores (Equation 3) as the cosine similarity between the textual features of the caption for photo p i and the textual representation of event e k . For textual features, we extract noun and verb unigrams using Turbo-Tagger (Martins et al., 2013) and weigh them by their discriminativeness relative to their scenario, P (S|w). Given scenario S and word w, P (S|w) is defined as the number of albums for the scenario the word occurs in divided by the total number of albums in that scenario. The visual affinity A v i,k is the similarity between the visual features of photo p i and the visual representation of event e k . For visual features, we use the convolutional features from the final layer activations of the 16-layer VG-GNet model (Simonyan and Zisserman, 2015).

Temporal Knowledge
Local transition probabilities. These probabilities, denoted as P L , encode an expected sequence of events using temporal patterns among adjacent where C is the observed counts of that specific event transition. This is the likelihood that an event e k is immediately followed by event e l .
Global pairwise ordering probabilities. These probabilities, denoted as P G , encode global structural patterns about events. We model P G for each pair of events as a binomial distribution by computing the likelihood that an event occurs before another at any point in an album, where C(e k ⇒ e l ) is the observed counts of e k occurring anytime before e l in all photo albums. These global probabilities model relations among events assigned to all photos in the album, not just events assigned to photos that are adjacent to one another. This distinction is important because these probabilities can encode global patterns between events and are not limited to modeling a sequential event chain. We use these learned temporal probabilities, P L and P G , in matrices L and G from φ temporal (Equation 5). These matrices are used to index local transition probabilities and global pairwise ordering probabilities for pairs of events when computing temporal knowledge scores in the inference model (Section 3.3).

Prototypical Captions
After clustering the photos, the representative language of the captions in each cluster begins to tell a story about each scenario. The event names are automatically extracted using the most common content words among captions in the cluster. For each cluster, we also compile prototypical captions by extracting captions whose lemmatized forms are frequently observed throughout multiple albums in the scenario. Sample events and their prototypical captions from three scenarios are displayed in Table 2.

Experimental Setup
Data split. For scenarios with more than 1000 albums, we use 100 albums for each of the development and test sets and use the rest for training. For scenarios with less than 1000 albums, we use 50 albums for each of the development and test sets, and the rest for training.
Implementation details. We optimize our objective function using integer linear programming (Roth and Yih, 2004) with the Gurobi solver (Inc., 2015). For computational efficiency, temporally close sets of consecutive photos are treated as one unit during the optimization. We use these units to reduce the number of variables and constraints in the model from a function of the number of photos to a function of the number of units. We form these units heuristically by merging images agglomeratively when their timestamps are within a certain range of the closest image in a unit. When merging photos, the textual affinity of each unit for a particular event is the maximum affinity for that event among photos in that unit. The visual affinity of each unit is the average of all affinities for that event among photos in that unit. The textual and visual similarities of consecutive units are defined in terms of the similarities between the two photos at the units' boundary. Temporal information for events not aligned well with a particular unit should not influence the objective, so we include temporal scores only for unit-event pairs which have both textual and visual event assignment scores greater than 0.05.
Hyperparameters. We tune the hyperparameters using grid search on the development set. In models where the corresponding objective components are included, we set γ ce = 1, γ ve = 1, γ cs = .5, γ vs = .15, γ lp = 1, and γ gp = 4 Q (where Q is the

Experimental Results
We evaluate the performance of our model on three tasks. The first task evaluates the effect of learned temporal knowledge in predicting the correct order of photos in an unseen album (Section 6.1). The second task evaluates the model's ability to segment albums into logical groupings (Section 6.2). The third task evaluates the quality of prototypical captions and their use in photo album summarization (Section 6.3).

Temporal Ordering of Photos
We evaluate the model's ability to capture the temporal relationships between events in the scenario. Given two randomly selected photos p i and p j from an album, the task is to predict which of the photos appears earlier in the album using their event assignments. We compare the full model that assigns events to photos using ILP (Section 3) with two baselines: k-MEANS , which assigns events to photos using k-means clustering over captions (Section 4), and NO TEMPORAL: a variant of the full model that does not use temporal knowledge scores (φ temporal in Equation 1) for optimization. We run each method over a test photo album, in which the events e k and e l are assigned to the photos p i and p j , respectively. We then use the learned global pairwise ordering probabilities (Section 4.2) to predict which photo appears earlier in the album. We report the accuracy of each method in predicting the order of photos compared to the actual order of photos in the albums. We perform this experiment 50 times for each album and average the number of correct choices across every album and every trial. Table 3 reports the results of the full model compared to the baselines. The results show that temporal knowledge generally helps in predicting photo ordering. We observe that the full model achieves higher scores for scenarios for which we expect would have a sequential structure (e.g., WEDDING, BABY BIRTH, MARATHON). Conversely, the full model achieves lower overall scores in non-sequential scenarios (e.g., PARIS TRIP, NEW YORK TRIP). Qualitatively, we notice interesting temporal patterns such as the fact that during a marathon, the starting line occurs before the medal awards with 92.3% probability, or that Parisian tourists have a 24% chance (∼10× higher than random chance) of visiting the Eiffel Tower immediately after the Arc de Triomphe (a high local transition probability that correctly implies their real world proximity).

Album Segmentation
Our model partitions photos in albums into coherent events. The album segmentation evaluation tests if the model recovers the same sequences of photos that a human would identify in a photo album as events.
Evaluation. We had an impartial annotator label where they thought events began and ended in 10 candidate albums of greater than 100 photos for three scenarios: WEDDING, FUNERAL, CAMPING. We evaluate how well our model can replicate these boundaries with two metrics. The first metric is the F 1 score of recovering the same boundaries annotated by humans. The second metric is d, the difference between the number of events segmented by the model compared to the annotated albums. We report results for exact event boundaries as well as relaxed boundaries where the start of an event can be r photos away from the start of an annotated event, where r is the relaxation coefficient. For reference, we note that albums in the wedding scenario were dual annotated and the agreement between annotators is 56.9% for r = 0 and 77.5% for r = 2.
Results. Table 4 shows comparison of the the full ILP model with same baselines we described before, k-MEANS and NO TEMPORAL. The table shows that the full model generally outperforms the k-MEANS baseline for all three scenarios.
In the WEDDING scenario, the F 1 score for the full   Table 5: Ablation study of objective function components for the wedding scenario. P, R, and F1 are the precision, recall and F-measure of recovering the same boundaries annotated by humans. d is the average difference between the number of events identified by our models and the annotators.
model is consistently higher. The k-MEANS baseline oversamples the number of events in albums, which is indicated by an average d significantly greater than 0. For the FUNERAL scenario, the NO TEMPORAL baseline outperforms the full model. We attribute this difference to the smaller data subset (see Table 1) making it harder to learn the temporal relations in the scenario, which makes the contributions of the local and global temporal probabilities unexpected. In the CAMPING scenario, the F 1 score for the k-MEANS baseline is higher than that of the full model when r = 0. At a highlevel, CAMPING is a scenario we expect has less of a known structure compared to other scenarios and may be harder to segment into its events.
Ablation Study. Table 5 depicts the performance of ablations of the full model for the wedding scenario. Results show that removing any component of the objective functions yields lower recall and F 1 scores than the full model for r = 0. The exception is removing local ordering probabilities, which yields a higher d. These observations support the hypothesis that all of the components of the objective function contribute to segmenting the album into subsequences of photos depicting the same event. Particularly, we note the degradation when removing the global ordering probabilities, indicating that approaches which model only local event transitions such as hidden Markov models would not be suitable for this task.

Photo Album Summarization
The final experiment evaluates how our learned prototypical captions can improve downstream tasks such as summarization and captioning.

Summaries
The goal of a good summary is to select the most salient pictures of an album. In our setting, a good summary should have a high coverage of the events in an album and choose the photos that most appropriately depict these events. Given a photo budget b, we choose a subset of photos that aims for these goals. To summarize a test album, we run our model over the entire album. This will yield h unique events assigned to the photos in the album. For each of these h events, we choose the photo with the highest event assignment score for that event (Equation 3) to be in the summary. If h > b, we count the number of photos in the training set assigned to each of the h events and choose the photos corresponding to the b events with the largest membership of photos in the training set. If h < b, we complete the summary with b − h photos from the "miscellaneous" event that are spaced evenly throughout the album. Finally, we replace the caption of each selected photo with a prototypical caption (Section 4.3) for the assigned event.
Baseline. We evaluate against two baselines. The first baseline, KTH, involves including a photo in the summary every k = M/b photos. The second baseline, k-MEANS, uses the events assigned to photos from k-means clustering and then picks b photos in the same manner as our main model.

Evaluation.
We evaluate the summaries produced by each method with a human evaluation using Amazon Mechanical Turk (AMT). We use albums from the test set that contain more than 40 photos for the wedding scenario. For each album, at random, we present two summaries generated by two algorithms. AMT workers are instructed to choose the better summary considering both the images and the captions. For each comparison of two summaries for an album, we aggregate answers from three workers by majority voting. We set b = 7. The number of assigned events in an album, h, varies by album.
Results. As seen in Table 6, the summary from the full model is preferred 57.7% of the time compared to the KTH baseline. The summaries generated using the full model perform slightly better than the summaries from k-MEANS. We attribute   the superior performance of the full model to the fact that it redistributes photos with noisy captions throughout the events, allowing for a larger sample to estimate visual representations of events, yielding more accurate visual affinity measurements to choose the summarization photos. As can be seen from qualitative examples in Figure 4, the photos chosen and the captions assigned cover key events that would occur during the scenario and describe them in a coherent way. Additional examples are available at https://www.cs.washington.edu/ projects/nlp/protoevents.

Prototypical Captions
We also evaluate the quality of the prototypical captions assigned to every photo in the summaries.
For each album, we use the same sets of b photos from the full model in the summarization task and evaluate the quality of the prototypical captions paired with that group of photos.

Evaluation.
We evaluate the quality of captions assigned to every photo by asking AMT workers to rate the captions on three different metrics: grammaticality, relevance to the scenario to which the image belongs, and relevance to its paired im-

Related Work
Previous studies have explored unsupervised induction of salient content structure in newswire texts (Barzilay and Lee, 2004), temporal graph representations (Bramsen et al., 2006), and storyline extraction and event summarization (Xu et al., 2013). Another line of research finds the common event structure from children's stories (McIntyre and Lapata, 2009), where the learned plot structure is used to stochastically generate new stories (Goyal et al., 2010;Goyal et al., 2013). Our work similarly aims to learn the typical temporal patterns and compositional elements that define common scenarios, but with multimodal integration. Compared to studies that learn narrative schemas from natural language (Pichotta and Mooney, 2014;Jans et al., 2012;Chambers and Jurafsky, 2009;Chambers, 2013;Cassidy et al., 2014), or compile script knowledge from crowdsourcing (Regneri et al., 2010), our work explores a new source of knowledge that allows grounded event learning with temporal dimensions, resulting in a new dataset of scenario types that are not naturally accessible from newswire or literature.
While recent studies have explored videos and photo streams as a source of discovering complex events and learning their sequential patterns (Kim and Xing, 2014;Kim and Xing, 2013;Tang et al., 2012;Tschiatschek et al., 2014), their focus was mostly on the visual modality. Zhang et al. (2015) explored multimodal information extraction focusing specifically on identifying video clips that referred to the same event in television news. This contrasts to the goal of our study that aims to learn the temporal structure by which common scenarios unfold.
Integrating language and vision has attracted increasing attention in recent years across diverse tasks such as image captioning (Karpathy and Fei-Fei, 2015;Vinyals et al., 2015;Fang et al., 2015;Xu et al., 2015;Chen et al., 2015), cross modal semantic modeling (Lazaridou et al., 2015), information extraction (Morency et al., 2011;Rosas et al., 2013;Zhang et al., 2015;Izadinia et al., 2015), common-sense knowledge (Vedantam et al., 2015;Bagherinezhad et al., 2016), and visual storytelling (Huang et al., 2016). Our work is similar to both common sense knowledge learning and visual story completion. Our model learns commonsense knowledge on the hierarchical and temporal event structure from scenario-specific multimodal photo albums, which can be viewed as visual stories about common life events.
Recent work focused on photo album summarization using visual  and multimodal representations (Sinha et al., 2011). Our work identifies the nature of common events in scenarios and learns their timelines and characteristic forms.

Conclusion
We introduce a novel exploration to learn scriptlike knowledge from photo albums. We model stochastic event structure to learn both the event representations (textual and visual) and the temporal relations among those events. Our event induction method incorporates learned knowledge about events, partitions photo albums into segments, and assigns events to those segments. We show the significance of our model in learning and using learned knowledge for photo ordering, album segmentation, and summarization. Finally, we provide a dataset depicting 12 scenarios with ∼1.5 M images for future research. Future directions could include exploring nuances in the type of temporal knowledge that can be learned across different scenarios.