First CFP: The First Workshop on Multimodal Knowledge and Language Modeling @ IJCAI 2025

Event Notification Type: 
Call for Papers
Abbreviated Title: 
MKLM@IJCAI 2025
Saturday, 16 August 2025
State: 
Montreal
Country: 
Canada
Contact Email: 
Contact: 
Liqiang Jing
Submission Deadline: 
Friday, 9 May 2025

In the last one to two years, we all have witnessed the rise of Large Language Models (LLMs), which have demonstrated unprecedented intelligence, offering hope for achieving Artificial General Intelligence (AGI). Correspondingly, Multimodal LLMs (MLLMs), also known as Large Vision-Language Models (LVLMs), have emerged, equipping LLMs with capabilities beyond language understanding to encompass multiple modalities, bringing us closer to realizing truly realistic AGI. Recently, MLLMs have seen rapid development, with various capable models being proposed within the community, showcasing strong understanding abilities across different modalities. However, an increasing number of studies reveal that the capabilities of MLLMs/LVLMs are still limited, such as frequently overlooking real-world knowledge leading to hallucinations, and inability to perform commonsense reasoning, or lacking domain-specific expert knowledge resulting in ineffectiveness in vertical domains. Therefore, one potentially underexplored aspect of this burgeoning field is the integration of external knowledge into MLLMs/LVLMs, a crucial component for enhancing reasoning, contextual understanding, and decision-making in real-world applications.

- Submission topics: Centered around the core of multimodal knowledge and language modeling, we aim to cover the following topics (but are not limited to):
Multimodal knowledge graph construction. Creating knowledge graphs that combine visual and textual information, allowing for richer, more detailed representations of entities, relationships, and event knowledge across modalities.
- Integration of structured and unstructured knowledge into MLLMs/LVLMs. This topic explores methods to incorporate both structured (e.g., knowledge graphs) and unstructured (e.g., image-text pairs) knowledge into vision-language models to improve their reasoning and contextual understanding.
- Knowledge-grounded vision-language reasoning. This focuses on enhancing multimodal reasoning tasks (e.g., complex visual question answering) by grounding MLLMs/LVLMs in external knowledge using retrieval-augmented generation techniques.
- Commonsense reasoning in multimodal tasks. This topic examines how to integrate commonsense knowledge into MLLMs/LVLMs to enable more intuitive, human-like reasoning across various multimodal applications.
- Benchmarking knowledge-augmented MLLMs/LVLMs on real-world tasks. Developing benchmarks and metrics to evaluate how knowledge-augmented MLLMs/LVLMs perform on practical tasks, including factuality/hallucinations, robustness, accuracy, and generalization across domains.
- Factuality and Hallucinations in MLLMs/LVLMs. MLLMs/LVLMs often generate outputs that include factual inaccuracies or hallucinations. This research topic focuses on investigating and mitigating hallucinations in MLLMs/LVLMs. Potential methods include integrating external knowledge sources to ensure that MLLMs/LVLMs can ground their outputs in verifiable facts.
- Explainability, trustworthiness, and safety in knowledge-driven MLLMs/LVLMs. It addresses the challenges of making knowledge-enhanced MLLMs/LVLMs more explainable and trustworthy, while also ensuring their safety in sensitive applications.

Organizing Committee
* Liqiang Jing, University of Texas at Dallas
* Xinya Du, University of Texas at Dallas
* Fei Hao, National University of Singapore
* Jing Gu, University of California, Santa Cruz
* Manling Li, Northwestern University
* Aixin Sun, Nanyang Technological University
* William Wang University of California, Santa Barbara