Singapore

Multimodal Large Language Models (MLLMs) have played an increasingly important role in multimodal intelligence. However, the existing fine-tuning methods often ignore cross-modal heterogeneity, limiting their full potential. In this work, we propose a novel fine-tuning strategy by injecting beneficial random noise, which outperforms previous methods and even surpasses full fine-tuning, with minimal additional parameters. The proposed Multimodal Noise Generator (MuNG) enables efficient modality fine-tuning by injecting customized noise into the frozen MLLMs. Specifically, we reformulate the reasoning process of MLLMs from a variational inference perspective, upon which we design a multimodal noise generator that dynamically analyzes cross-modal relationships in image-text pairs to generate task-adaptive beneficial noise. Injecting this type of noise into the MLLMs effectively suppresses irrelevant semantic components, leading to significantly improved cross-modal representation alignment and enhanced performance on downstream tasks. Experiments on two mainstream MLLMs, LLaVA and QwenVL, demonstrate that our method surpasses full-parameter fine-tuning and other existing fine-tuning approaches, while requiring adjustments to only about $1\sim2\%$ additional parameters. The relevant code is uploaded in the supplementary.

AAAI 2026

Explore How to Inject Beneficial Noise in MLLMs

large multimodal models (lmms)

multimodal learning

machine learning

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Knowledge distillation (KD) is a promising compression technique for reducing the computational burden of large language models (LLMs). Depending on access to the teacher model’s internal parameters, KD is typically categorized into white-box and black-box KD. While white-box KD benefits from full access to intrinsic knowledge such as softmax distributions, black-box KD adopts a black-box LLM (e.g., GPT-4) as the teacher, which provides only text-level outputs via API calls. This limited supervision makes black-box KD generally less effective than its white-box counterpart. To bridge the gap between white-box and black-box KD, we propose GrayKD, a novel framework that can effectively distill text-level knowledge from a black-box LLM in a single-stage manner. In particular, rationales generated by the black-box LLM are injected into the student via a lightweight cross-attention module (teacher mode), enabling the model to approximate the black-box teacher’s output distribution without access to internal parameters. The student is then trained with the softmax-level knowledge provided by the teacher mode (student mode). Since both the teacher and student modes share the same backbone, the proposed teacher mode remains highly parameter-efficient, requiring only a small number of additional parameters for rationale injection. Experimental results on instruction-following tasks demonstrate that GrayKD achieves substantial performance improvements over existing KD methods.

GrayKD: Distilling Better Knowledge from Black-box LLM via Multi-rationale Injection

Several studies have demonstrated that large language models (LLMs) exhibit positional bias when answering multiple-choice questions (MCQs). Previous methods have identified such bias to be detrimental, leading to the development of techniques to mitigate it. However, we observe that certain permutations of options can actually improve the performance. Therefore, instead of eliminating such bias, we propose an EMbracing the Bias EquivaRiantly (EMBER) network. Specifically, the EMBER network, which outputs a permutation of options in MCQs, is optimized towards the beneficial permutations to which the LLM is biased. Additionally, to solve the positional bias among different permutations of options, the EMBER network is designed to grant the equivariance to the permutation to the LLMs. Theoretically and empirically, we show that the proposed EMBER network can effectively utilize the positional bias and demonstrate state-of-the-art performance over various baselines.

Embracing Positional Bias in Multiple-Choice Question Answering via Permutation Equivariant Neural Networks

Vision-Language Models (VLMs) have advanced multimodal understanding, yet they remain susceptible to adversarial attacks. Among various strategies, transfer-based attacks are notably effective, especially in black-box scenarios. The dominant approach within this paradigm leverages generative models to create image targets from text, consistently outperforming text-only methods. However, this approach suffers from a fundamental limitation: generative models introduce visual features irrelevant or even detrimental to textual semantics, misguiding optimization. 
To investigate this limitation, we conduct comprehensive analysis revealing two critical findings. First, optimal attack directions lie in synergistic spaces between image and text gradients, demonstrating that text provides complementary information. Second, widespread gradient conflicts occur when combining modalities, where image-target gradients oppose text-target directions. This conflict provides direct evidence that extraneous visual information actively harms optimization, driving it away from intended textual objectives.
Based on these insights, we propose Text-Guided Gradient Refinement (TGGR), a novel framework that employs a conflict-aware projection mechanism to resolve this conflict. TGGR preserves the beneficial characteristics of image targets by decomposing the image gradient and surgically removing components that oppose the textual guidance. Extensive experiments on models such as LLaVA and GPT-4o demonstrate that TGGR substantially improves attack success rates. Specifically, on GPT-4o, TGGR yields an improvement of up to 14\% over state-of-the-art methods, achieving 96\% attack success rate.
Our work offers a principled framework for developing more synergistic and effective adversarial strategies against VLMs.

Text-Guided Gradient Refinement: Resolving Multimodal Gradient Conflicts to Boost Adversarial Attacks on Vision-Language Models

Open-set semi-supervised learning (OSSL) leverages unlabeled data containing both in-distribution (ID) and unknown out-of-distribution (OOD) samples, aiming simultaneously to improve closed-set accuracy and detect novel OOD instances. Existing methods either discard valuable information from uncertain samples or force-align every unlabeled sample into one or a few synthetic “catch-all” representations, resulting in geometric collapse and overconfidence on only seen OODs. To address the limitations, we introduce selective non-alignment, adding a novel “skip” operator into conventional pull and push operations of contrastive learning. Our framework, SkipAlign, selectively skips alignment (pulling) for low-confidence unlabeled samples, retaining only gentle repulsion against ID prototypes. This approach transforms uncertain samples into a pure repulsion signal, resulting in tighter ID clusters and naturally dispersed OOD features. Extensive experiments demonstrate that SkipAlign significantly outperforms state-of-the-art methods in detecting unseen OOD data without sacrificing ID classification accuracy.

Let the Void Be Void: Robust Open-Set Semi-Supervised Learning via Selective Non-Alignment

Partially observable Markov decision processes (POMDPs) present significant challenges for reinforcement learning, as agents must learn optimal policies while maintaining belief states over unobserved environment states based on partial observations. 
We observe a compelling analogy: large language models (LLMs) autoregressively generate token probability distributions based on preceding context, mirroring how belief states are maintained and updated in POMDPs. 
This insight motivates leveraging the rich prior knowledge embedded in pre-trained LLMs for latent states estimation from observation-action histories. 
However, two critical challenges emerge: on the one hand, modality misalignment prevents LLMs from directly encoding visual observations and discrete actions; on the other hand, semantic misalignment exists between observation-action sequences and token sequences. 
To address these challenges, we introduce a novel framework ELSLLM that employs a Johnson-Lindenstrauss projection (JLP) module to transform input dimensions while preserving state similarity with theoretical guarantees, and utilizes modern Hopfield networks (MHN) to store all word embeddings from pre-trained LLMs as a knowledge repository. 
Through retrieval and querying mechanisms, ELSLLM achieves token-level knowledge alignment without requiring fine-tuning of the pre-trained LLMs. 
Extensive experiments on partially observable environments demonstrate that ELSLLM achieves state-of-the-art performance, significantly outperforming baseline methods with and without LSTM memory mechanisms. 
Our work opens new avenues for integrating pre-trained LLMs with reinforcement learning in partially observable settings.

From Tokens to Latent States: Leveraging Pre-trained Language Models for Improving Partially Observable Reinforcement Learning

Regulatory compliance checking for online medical advertisements poses a critical public safety challenge distinct from traditional fact-checking, particularly in low-resource languages. Existing automated systems are ill-suited for the authorization-based, evidence-grounded, and explainable reasoning this task demands. To address this gap, we introduce \texttt{VietCheckMed}, a novel retrieval-augmented framework, and \texttt{VietAestheticAds}, the first large-scale, expert-validated benchmark for this task, comprising \textbf{8,329 advertisements} paired with an authoritative regulatory corpus of \textbf{9,978 facilities}. Comprehensive experiments demonstrate that our evidence-grounded approach is essential, substantially outperforming powerful unassisted LLM baselines by over 0.3805 F1-score. A detailed analysis reveals that the primary remaining challenges are nuanced failures in semantic and logical reasoning, defining a clear frontier for future research. To promote advances in regulatory technology and responsible AI, our dataset, code, and evaluation scripts will be made publicly available. This work contributes a foundational methodology and a vital public resource for developing responsible AI in high-stakes regulatory domains.

VietCheckMed: Explainable Regulatory Compliance Checking for Medical Advertisements on Vietnamese Social Media

Large vision-language models (LVLMs) have shown impressive performance across a broad range of multimodal tasks. However, robust image caption evaluation using LVLMs remains challenging, particularly under domain-shift scenarios. To address this issue, we introduce the Distribution-Aware Score Decoder (DISCODE), a novel finetuning-free method that generates robust evaluation scores better aligned with human judgments across diverse domains. The core idea behind DISCODE lies in its test-time adaptive evaluation approach,
which introduces the Adaptive Test-Time (ATT) loss, leveraging a Gaussian prior distribution to improve robustness in evaluation score estimation. This loss is efficiently minimized at test time using an analytical solution that we derive. Furthermore, we introduce the Multi-domain Caption Evaluation (MCEval) benchmark, a new image captioning evaluation benchmark covering six distinct domains, designed to assess the robustness of evaluation metrics. In our experiments, we demonstrate that DISCODE achieves state-of-the-art performance as a reference-free evaluation metric across MCEval and four representative existing benchmarks. Our code and dataset will be made publicly available.

DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning

In the field of audio generation, 
signal-to-noise ratio (SNR) has long served as an objective metric for evaluating audio quality. Nevertheless, recent studies have shown that SNR and its variants are not always highly correlated with human perception, prompting us to raise the questions: \textit{Why does SNR fail in measuring audio quality?} And \textit{how to improve its reliability as an objective metric?} In this paper, we identify the inadequate measurement of phase distance as a pivotal factor and propose to reformulate SNR with specially designed phase-distance terms, yielding an improved metric named GOMPSNR. We further extend the newly proposed formulation to derive two novel categories of loss function, corresponding to magnitude-guided phase refinement and joint magnitude-phase optimization, respectively. Besides, extensive experiments are conducted for an optimal combination of different loss functions. Experimental results on advanced neural vocoders demonstrate that our proposed GOMPSNR exhibits more reliable error measurement than SNR. Meanwhile, our proposed loss functions yield substantial improvements in model performance, and our well-chosen combination of different loss functions further optimizes the overall model capability.

GOMPSNR: Reflourish the Signal-to-Noise Ratio Metric for Audio Generation Tasks

Text-guided image inpainting endeavors to generate new content within specified regions of images using textual prompts from users. The primary challenge is to accurately align the inpainted areas with the user-provided prompts while maintaining a high degree of visual fidelity. While existing inpainting methods have produced visually convincing results by leveraging the pre-trained text-to-image diffusion models, they still struggle to uphold both prompt alignment and visual rationality simultaneously. In this work, we introduce FreeInpaint, a plug-and-play tuning-free approach that directly optimizes the diffusion latents on the fly during inference to improve the faithfulness of the generated images. Technically, we introduce a prior-guided noise optimization method that steers model attention towards valid inpainting regions by optimizing the initial noise. Furthermore, we meticulously design a composite guidance objective tailored specifically for the inpainting task. This objective efficiently directs the denoising process, enhancing prompt alignment and visual rationality by optimizing intermediate latents at each step. Through extensive experiments involving various inpainting diffusion models and evaluation metrics, we demonstrate the effectiveness and robustness of our proposed FreeInpaint.

FreeInpaint: Tuning-free Prompt Alignment and Visual Rationality Enhancement in Image Inpainting

3D Gaussian Splatting (3DGS) has revolutionized 3D scene representation in both efficiency and quality. Recent adaptations of Gaussian splatting specifically tailored for computed tomography (CT) have shown promising results but still suffer from severe artifacts under highly sparse-view X-ray conditions and lack robustness in dynamic scenarios. To tackle these challenges, we propose Tomographic Geometry Field (TG-Field), a geometry-aware Gaussian deformation framework tailored specifically for sparse-view and dynamic CT reconstruction. A hash encoder is introduced to explicitly capture spatial geometric relationships among Gaussian primitives, significantly regularizing their spatial distribution under ultra-sparse conditions. We further extend this framework to dynamic reconstruction by introducing time-conditioned representations. To alleviate hash collisions and temporal inconsistencies caused by joint spatiotemporal encoding, a spatiotemporal attention module is proposed, which adaptively recalibrates and optimizes Gaussian features across frames. Moreover, we incorporate a motion-flow network to model fine-grained respiratory motion, enabling accurate tracking of local anatomical deformations. Extensive experiments on synthetic and real-world datasets demonstrate that TG-Field consistently outperforms existing methods, achieving state-of-the-art reconstruction accuracy under highly sparse-view conditions. We will release source codes.

Content not yet available

Next from AAAI 2026

GrayKD: Distilling Better Knowledge from Black-box LLM via Multi-rationale Injection

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES