Singapore

Image captioning is crucial for multimodal understanding, bridging visual content and natural language. Despite recent advancements in Large Multimodal Models (LMMs), when faced with unseen entities or scenes in the open world, even when attempting to leverage learned knowledge, models still struggle with vague and inaccurate descriptions, and may even generate knowledge hallucinations. A key reason is that the model fails to effectively integrate knowledge with visual information, limiting its understanding of visual content. Thus, we propose Adaptive Knowledge Graph-guided Multimodal Alignment (AKGMA) for image captioning, which enhances semantic understanding in open-world scenes through visual knowledge reasoning, reducing knowledge hallucinations and improving caption quality. It consist three key components: Entity-guided Knowledge Aligner (EKA), Adaptive Knowledge Graph Construction (AKGC), and Scene-Context Knowledge Adapter (SCKA). EKA connects visual entities to knowledge graphs, providing structured knowledge to a small language model, which interacts with a visual encoder to acquire visual knowledge. AKGC uses reinforcement learning to build image-relevant subgraphs to optimize knowledge prompts and improve knowledge hallucinations. SCKA leverages scene graph annotations to extract visual contextual knowledge and inject it into Large Language Models (LLMs), ensuring the generated descriptions are consistent with the image&#39;s details. Additionally, we introduce UniKnowCap, a new image knowledge description dataset spanning various open-world knowledge domains, designed to evaluate the knowledge accuracy and detail consistency of model-generated descriptions. Extensive experiments show our model outperforms baselines across multiple metrics.

AAAI 2026

Knowledge-Enhanced Image Captioning with Adaptive Graph-based Multimodal Alignment and LLM

knowledge graphs & kb completio

dmkm: linked open data

dmkm: mining of visual

multimedia & multimodal data

cv: multi-modal vision

Image captioning is crucial for multimodal understanding, bridging visual content and natural language. Despite recent advancements in Large Multimodal Models (LMMs), when faced with unseen entities or scenes in the open world, even when attempting to leverage learned knowledge, models still struggle with vague and inaccurate descriptions, and may even generate knowledge hallucinations. A key reason is that the model fails to effectively integrate knowledge with visual information, limiting its understanding of visual content. Thus, we propose Adaptive Knowledge Graph-guided Multimodal Alignment (AKGMA) for image captioning, which enhances semantic understanding in open-world scenes through visual knowledge reasoning, reducing knowledge hallucinations and improving caption quality. It consist three key components: Entity-guided Knowledge Aligner (EKA), Adaptive Knowledge Graph Construction (AKGC), and Scene-Context Knowledge Adapter (SCKA). EKA connects visual entities to knowledge graphs, providing structured knowledge to a small language model, which interacts with a visual encoder to acquire visual knowledge. AKGC uses reinforcement learning to build image-relevant subgraphs to optimize knowledge prompts and improve knowledge hallucinations. SCKA leverages scene graph annotations to extract visual contextual knowledge and inject it into Large Language Models (LLMs), ensuring the generated descriptions are consistent with the image's details. Additionally, we introduce UniKnowCap, a new image knowledge description dataset spanning various open-world knowledge domains, designed to evaluate the knowledge accuracy and detail consistency of model-generated descriptions. Extensive experiments show our model outperforms baselines across multiple metrics.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Diffusion-based image compression has demonstrated impressive perceptual performance. However, it suffers from two critical drawbacks: (1) excessive decoding latency due to multi-step sampling, and (2) poor fidelity resulting from over-reliance on generative priors. To address these issues, we propose SODEC, a novel single-step diffusion image compression model. We argue that in image compression, a sufficiently informative latent renders multi-step refinement unnecessary. Based on this insight, we leverage a pre-trained VAE-based model to produce latents with rich information, and replace the iterative denoising process with a single-step decoding. Meanwhile, to improve fidelity, we introduce the fidelity guidance module, encouraging outputs that are faithful to the original image.
Furthermore, we design the rate annealing training strategy to enable effective training under extremely low bitrates.
Extensive experiments show that SODEC significantly outperforms existing methods, achieving superior rate–distortion–perception performance. Moreover, compared to previous diffusion-based compression models, SODEC improves decoding speed by more than 20$\times$. Code will be made public.

Steering One-Step Diffusion Model with Fidelity-Rich Decoder for Fast Image Compression

The spreading of AI-generated images (AIGI), driven by advances in generative AI, poses a significant threat to in-
formation security and public trust. Existing AIGI detectors, while effective against images in clean laboratory settings, fail to generalize to in-the-wild scenarios. These real-world images are noisy, varying from “obviously fake” images to realistic ones derived from multiple generative models and further edited for quality control. We address in-the-wild AIGI detection in this paper. We introduce MIRAGE,
a challenging benchmark designed to emulate the complexity of in-the-wild AIGI. MIRAGE is constructed from two sources: (1) a large corpus of Internet-sourced AIGI verified by human experts, and (2) a synthesized dataset created through the collaboration between multiple expert generators, closely simulating the realistic AIGI in the wild. Building on this benchmark, we propose MIRAGE-R1, a vision-
language model with heuristic-to-analytic reasoning, a reflective reasoning mechanism for AIGI detection. MIRAGE-R1
is trained in two stages: a supervised-fine-tuning cold start, followed by a reinforcement learning stage. By further adopting a inference-time adaptive thinking strategy, MIRAGE-R1 is able to provide either a quick judgment or a more robust and accurate conclusion, effectively balancing inference speed and performance. Extensive experiments show that our model leads state-of-the-art detectors by 5% and 10% on MIRAGE and public benchmark, respectively. The benchmark and code will be made publicly available.

MIRAGE: Towards AI-Generated Image Detection in the Wild

Multi-subject personalized image generation aims to synthesize customized images containing multiple specified subjects without requiring test-time optimization. However, achieving fine-grained independent control over multiple subjects remains challenging due to difficulties in preserving subject fidelity and preventing cross-subject attribute leakage. We present \textbf{FocusDPO}, a framework that adaptively identifies focus regions based on dynamic semantic correspondence and supervision image complexity. During training, our method progressively adjusts these focal areas across noise timesteps, implementing a weighted strategy that rewards information-rich patches while penalizing regions with low prediction confidence. The framework dynamically adjusts focus allocation during the DPO process according to the semantic complexity of reference images and establishes robust correspondence mappings between generated and reference subjects. Extensive experiments demonstrate that our method substantially enhances the performance of existing pre-trained personalized generation models, achieving state-of-the-art results on both single-subject and multi-subject personalized image synthesis benchmarks. Our method effectively mitigates attribute leakage while preserving superior subject fidelity across diverse generation scenarios, advancing the frontier of controllable multi-subject image synthesis.

FocusDPO: Dynamic Preference Optimization for Multi-Subject Personalized Image Generation via Adaptive Focus

Fake news detection methods based on writing style have achieved remarkable progress. However, as adversaries increasingly imitate the style of authentic news, the effectiveness of such approaches is gradually diminishing. Recent research has explored incorporating large language models (LLMs) to enhance fake news detection. Yet, despite their transformative potential, LLMs remain an untapped goldmine for fake news detection, with their real-world adoption hampered by shallow functionality exploration, ambiguous usability, and prohibitive inference costs. In this paper, we propose a novel fake news detection framework, dubbed FACTGUARD, that leverages LLMs to extract event-centric content, thereby reducing the impact of writing style on detection performance. Furthermore, our approach introduces a dynamic usability mechanism that identifies contradictions and ambiguous cases in factual reasoning, adaptively incorporating LLM advice to improve decision reliability. To ensure efficiency and practical deployment, we employ knowledge distillation to derive FACTGUARD-D, enabling the framework to operate effectively in cold-start and resource-constrained scenarios. Comprehensive experiments on two benchmark datasets demonstrate that our approach consistently outperforms existing methods in both robustness and accuracy, effectively addressing the challenges of style sensitivity and LLM usability in fake news detection. Code and data are available at https://shorturl.at/Uf9k7.

FACTGUARD: Event-Centric and Commonsense-Guided Fake News Detection

Monocular 3D object detection offers a cost-effective solution for autonomous driving, but it suffers from the ill-posed depth and a limited field of view. These constraints lead to the lack of geometric cues and reduced accuracy in occluded or truncated scenes. While recent approaches incorporate additional depth information to address geometric ambiguity, they overlook the importance of visual cues essential for robust object recognition. In this paper, we propose MonoCLUE that enhances monocular 3D detection by leveraging both local clustering and generalized scene memory of visual features. First, we perform K-means clustering on visual features to capture distinct object-level appearance visual parts (e.g., bonnet, car roof), which improves the detection of partially visible objects. The clustered features are then propagated across the entire region to capture objects with similar appearances. Second, we construct a generalized scene memory by aggregating clustered features across images, providing consistent appearance representations that generalize scenes. This improves the consistency of object-level features, enabling stable detection across varying environments. Lastly, we integrate both local cluster features and generalized scene memory into object queries, guiding attention toward informative regions in the feature map. Exploiting an unified local clustering and generalized scene memory strategy, MonoCLUE enables robust monocular 3D detection under occlusion and limited visibility. Our proposed model achieves state-of-the-art performance on the KITTI benchmark.

MonoCLUE: Object-Aware Clustering Enhances Monocular 3D Object Detection

Facial expression captioning has found widespread application across various domains. Recently, the emergence of video Multimodal Large Language Models (MLLMs) has shown promise in general video understanding tasks. However, describing facial expressions within videos poses two major challenges for these models: (1) the lack of adequate datasets and benchmarks, and (2) the limited visual token capacity of video MLLMs. To address these issues, this paper introduces a new instruction-following dataset tailored for dynamic facial expression caption. The dataset comprises 5,033 high-quality video clips annotated manually, containing over 700,000 tokens. Its purpose is to improve the capability of video MLLMs to discern subtle facial nuances. Furthermore, we propose FaceTrack-MM, which leverages a limited number of tokens to encode the main character’s face. This model demonstrates superior performance in tracking faces and focusing on the facial expressions of the main characters, even in intricate multiperson scenarios. Additionally, we introduce a novel evaluation metric combining event extraction, relation classification, and the longest common subsequence (LCS) algorithm to assess the content consistency and temporal sequence consistency of generated text. Moreover, we present FECBench, a benchmark designed to assess the performance of existing video MLLMs in this specific task. All data and source code will be made publicly available.

Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness

Medical vision-language pre-training (VLP) offers significant potential for advancing medical image understanding by leveraging paired image-report data. However, existing methods are limited by $\textbf{Fa}$lse $\textbf{Ne}$gatives (FaNe) induced by semantically similar texts and insufficient fine-grained cross-modal alignment. To address these limitations, we propose FaNe, a semantic-enhanced VLP framework. To mitigate false negatives, we introduce a semantic-aware positive pair mining strategy based on text-text similarity with adaptive normalization. Furthermore, we design a text-conditioned sparse attention pooling module to enable fine-grained image-text alignment through localized visual representations guided by textual cues. To strengthen intra-modal discrimination, we develop a hard-negative aware contrastive loss that adaptively reweights semantically similar negatives. Extensive experiments on five downstream medical imaging benchmarks demonstrate that FaNe achieves state-of-the-art performance across image classification, object detection, and semantic segmentation, validating the effectiveness of our framework.

FaNe: Towards Fine-Grained Cross-Modal Contrast with False-Negative Reduction and Text-Conditioned Sparse Attention

Large models such as Vision Transformers (ViTs) have demonstrated remarkable superiority over smaller architectures like ResNet in few-shot classification, owing to their powerful representational capacity. However, fine-tuning such large models demands extensive GPU memory and prolonged training time, making them impractical for many real-world low-resource scenarios. To bridge this gap, we propose EfficientFSL, a query-only fine-tuning framework tailored specifically for few-shot classification with ViT, which achieves competitive performance while significantly reducing computational overhead. EfficientFSL fully leverages the knowledge embedded in the pre-trained model and its strong comprehension ability, achieving high classification accuracy with an extremely small number of tunable parameters. Specifically, we introduce a lightweight trainable Forward Block to synthesize task-specific queries that extract informative features from the intermediate representations of the pre-trained model in a query-only manner. We further propose a Combine Block to fuse multi-layer outputs, enhancing the depth and robustness of feature representations. Finally, a Support-Query Attention Block mitigates distribution shift by adjusting prototypes to align with the query set distribution. 
With minimal trainable parameters, EfficientFSL achieves state-of-the-art performance on four in-domain few-shot datasets and six cross-domain datasets, demonstrating its effectiveness in real-world applications. Code will be released after acceptance.

EfficientFSL: Enhancing Few-Shot Classification via Query-Only Tuning In Vision Transformers

Federated foundation models represent a new paradigm to jointly fine-tune pre-trained foundation models across clients. It is still a challenge to fine-tune foundation models for a small group of new users or specialized scenarios, which typically involve limited data compared to the large-scale data used in pre-training. In this context, the trade-off between personalization and federation becomes more sensitive. To tackle these, we proposed a bi-level personalization framework for federated fine-tuning on foundation models. Specifically, we conduct personalized fine-tuning on the client-level using its private data, and then conduct a personalized aggregation on the server-level using similar users measured by client-specific task vectors. Given the personalization information gained from client-level fine-tuning, the server-level personalized aggregation can gain group-wise personalization information while mitigating the disturbance of irrelevant or interest-conflict clients with non-IID data. The effectiveness of the proposed algorithm has been demonstrated by extensive experimental analysis in benchmark datasets.

Bi-level Personalization for Federated Foundation Models: A Task-vector Aggregation Approach

We introduce Human Motion Unlearning and motivate it through the concrete task of preventing violent 3D motion synthesis, an important safety requirement given that popular text-to-motion datasets (HumanML3D and Motion-X) contain from 7% to 15% violent sequences spanning both atomic gestures (e.g., a single punch) and highly compositional actions (e.g., loading and swinging a leg to kick). By focusing on violence unlearning, we demonstrate how removing a challenging, multifaceted concept can serve as a proxy for the broader capability of motion "forgetting."
To enable systematic evaluation of Human Motion Unlearning, we establish the first motion unlearning benchmark by automatically filtering HumanML3D and Motion-X datasets to create distinct forget sets (violent motions) and retain sets (safe motions). We introduce evaluation metrics tailored to sequential unlearning, measuring both suppression efficacy and the preservation of realism and smooth transitions.
We adapt two state-of-the-art, training-free image unlearning methods (UCE and RECE) to leading text-to-motion architectures (MoMask and BAMM), and propose Latent Code Replacement (LCR), a novel, training-free approach that identifies violent codes in a discrete codebook representation and substitutes them with safe alternatives.
Our experiments show that unlearning violent motions is indeed feasible and that acting on latent codes strikes the best trade-off between violence suppression and preserving overall motion quality. This work establishes a foundation for advancing safe motion synthesis across diverse applications.

Content not yet available

Next from AAAI 2026

Steering One-Step Diffusion Model with Fidelity-Rich Decoder for Fast Image Compression

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES