While information retrieval systems for text documents have been extensively studied for decades, the landscape has shifted; vast amounts of information today are stored as videos with minimal text metadata. For instance, as of January 2024, YouTube hosts over 14 billion videos. Despite the explosion of multimodal data, there remains a dearth of research around the efficient retrieval, processing, and synthesis of these massive multimodal collections. Existing systems largely still rely on text metadata (e.g., YouTube descriptions), overlooking the rich semantic content embedded within the multimodal data itself.
Individual research groups have independently begun addressing this challenge, leading to parallel yet disconnected efforts to define the research space. We are hosting a collaborative venue to unify these efforts and foster dialogue, which we believe is crucial for advancing the field. Our proposed workshop will focus on two primary areas: (1) the retrieval of multimodal content, which spans text, images, audio, video, and multimodal data (e.g., image-language, video-language); and (2) retrieval-augmented generation, with an emphasis on multimodal retrieval and generation. To further this goal, we will host a shared task on event-based video retrieval and understanding, designed to spark interest and facilitate research development in both retrieval and generation. This task’s primary retrieval metric, nDCG@10, will compare the final ranked lists of videos produced by participant systems.
MAGMaR Shared Task
Existing news video datasets focus primarily on English news broadcasts. To address this limitation, Sanders et al. introduced MultiVENT, a dataset of multilingual event-centric videos aligned with text documents in five target languages. However, both MultiVENT (2,400 videos) and MSR-VTT (10,000 videos) remain small compared to standard text retrieval collections; in comparison, HC4, the text corpus used in the 2022 NeuCLIR TREC shared task, contains approximately 6 million documents. To create a challenging and practically useful video retrieval task, we introduced MultiVENT 2.0, a collection containing over 217,000 videos. This includes 2,549 event-centric queries for a test collection of 109,800 videos (MultiVENT Test), capturing a diverse range of current events. Preliminary results show that this task poses significant challenges for current state-of-the-art vision-language models.
The shared task will focus on retrieving relevant visual content related to specific current events. Our goals are to evaluate the effectiveness of existing multimodal models, e.g., language models, for retrieving multilingual, event-based visual content; explore the contributions of different modalities to this task; and assess how retrieved content influences downstream generation results. Submitted systems will be evaluated in two ways. First, as a standard ranked information retrieval task, using established metrics from text-based retrieval, such as normalized Discounted Cumulative Gain (nDCG). Additionally, we also propose a pilot evaluation to assess each system's downstream effectiveness on retrieval-augmented generation, using a standard vision-language model (e.g., GPT-4V) along with both automatic metrics and human evaluations. Evaluation leaderboards will be hosted on Eval.ai.
The MultiVENT 2.0 dataset is available at https://huggingface.co/datasets/hltcoe/MultiVENT2.0.
Schedule
This will be a one-day hybrid workshop to allow remote participation. The morning session will feature our first invited speaker, followed by selected oral paper presentations. In the afternoon, additional speaker presentations will precede an overview of the shared task and results. The day will conclude with oral presentations of shared task submissions, paper and shared task awards, and a final poster session.