Content not yet available

This lecture has no active video or poster.

AAAI 2026

January 25, 2026

Singapore, Singapore

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Multimodal large language models (MLLMs) have achieved significant results in various tasks, but their practical application is still severely constrained by hallucination issues, which are particularly prominent in reinforcement learning (RL) optimization processes. This paper systematically analyzes the causes of hallucinations in MLLM under RL training, identifying three key factors: (1) The model relies heavely on chained visual reasoning to guide decision-making during RL training. Thus, error and irrelevant information in visual reasoning can easily cause hallucinations, including inaccurate initial visual descriptions that anchor subsequent inferences to incorrect information, as well as redundant and broad inferential information; (2) Insufficient exploration diversity during the policy optimization phase, causing the model to output overly confident results; (3) The destructive conflict between different samples during optimization is a key factor that leads to false associations and unstable parameter updates. To address these issues, we propose a solution framework comprising three core modules. First, to improve the accuracy of visual localization, we add planning and caption stages before thinking and answer stages. To enhance initial visual descriptions ability, we allow LLMs to respond based solely on the caption and provide corresponding caption reward based on the quality of the response. Second, to enhance exploration capabilities, we classify samples based on the mean and variance of the reward distribution and select samples with high reward variance for training, thereby increasing the model's focus on diverse samples. Finally, to mitigate conflicts between training samples, we identify neural tangent kernel (NTK) similarity as the key factor. Rather than minimizing it uniformly, we regulate NTK similarity by grouping sample pairs based on a similarity threshold. An InfoNCE loss is then applied to pull dissimilar pairs closer and push overly similar ones apart, guiding interactions toward a balanced range. We conducted extensive empirical studies on image, video, and standard hallucination evaluation benchmarks. The experimental results demonstrate that the proposed method significantly reduces the hallucination rate and effectively improves the inference accuracy of MLLMs.

Downloads

Paper

Next from AAAI 2026

Unified Interaction Consistency Learning for Single-Source Domain-Generalized Object Detection in Urban Scene
poster

Unified Interaction Consistency Learning for Single-Source Domain-Generalized Object Detection in Urban Scene

AAAI 2026

Gong Cheng and 2 other authors

25 January 2026

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved