Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
The proliferation of multi-modal data on the internet has intensified the need for structured event understanding across text and visual modalities. However, existing cross-modal event extraction models suffer from three major limitations: the absence of explicit event schema guidance, coarse-grained multi-modal alignment strategies, and reliance on heterogeneous, misaligned multi-modal training dataset. To address these issues, we propose a Multi-modal Schema-Guided Progressive Instruction Tuning framework (LLaVA-MS-PIT) that explicitly injects structured multi-modal event schema knowledge into the model before event extraction. Specifically, we introduce textual event schema to establish the model’s prior knowledge of event information and enhance its ability for event structure reasoning, while visual event schema is employed to bridge the representational gap between textual and visual modalities at the event level, enabling unified and semantically aligned event representations across modalities. Furthermore, to alleviate these challenges of data scarcity and modality misalignment inherent in current benchmarks, we further construct imSitu-MME, a high-quality multi-modal parallel dataset constructed via schema-guided data generation and annotation. Extensive experiments demonstrate that LLaVA-MS-PIT achieves competitive performance on multi-modal event extraction benchmarks, indicating the effectiveness and necessity of schema-guided progressive instruction tuning.