Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Grounded Multimodal Named Entity Recognition (GMNER) is a new information extraction task. It requires models to extract named entities and ground them to real-world visual objects. Previous methods, relying on domain-specific fine-tuning, struggle with unseen multimodal entities due to limited knowledge and generalization. Recently, multimodal large language models (MLLMs) have demonstrated strong open-set abilities. However, their performance is hindered by the lack of in-domain knowledge due to costly training for GMNER datasets. To address these limitations, we propose UnCo, a two-stage Uncertainty-driven Collaborative framework that leverages the complementary strengths of small fine-tuned models and MLLMs. Specifically, in stage one, we equip the small model with a unified uncertainty estimation (UE) for multimodal entities. This enables the small model to express textit"I do not know" when recognizing unseen entities beyond its capabilities. To obtain more robust UE performance, we introduce a modality representation alignment module that mitigates inter-modality bias. Predictions with high uncertainty are then filtered and delegated to the MLLM. In stage two, an Uncertainty-aware Hierarchical Correction mechanism guides the MLLM to refine uncertain predictions using its open-domain knowledge. Ultimately, UnCo effectively retains the in-domain knowledge of small models while utilizing the capabilities of MLLMs to handle unseen samples. Extensive experiments demonstrate UnCo's effectiveness on two GMNER benchmarks.