Singapore

Pretrained vision-language models (VLMs), especially CLIP, excel at adapting to downstream tasks through fine-tuning with sufficient high-quality labeled data. 
However, real-world training data often contains noisy labels, leading to significant performance degradation when models are naively fine-tuned on them. 
Existing noisy label learning methods for VLMs typically leverage the model&#39;s own pretrained knowledge, either via zero-shot predictions or vanilla self-training based on them, to identify and handle noisy samples. 
Crucially, these approaches blindly trust the VLM&#39;s pretrained knowledge, which can introduce endogenous confirmation bias: erroneous pretrained priors lead to incorrect noise detection, further amplifying the bias and corrupting the model.
To overcome this limitation, we propose the Debiased Knowledge Adaptation Framework (DKAF), which empowers the model to challenge and correct potentially flawed zero-shot predictions. 
DKAF operates in three progressive phases:
(1) Clean Sample Selection. We introduce a cross-modal collaborative pseudo-labeling to train a robust noisy label detector, explicitly mitigating confirmation bias by aggregating diverse signals beyond the model&#39;s initial zero-shot view.
(2) Noisy Label Refinement. For samples identified as noisy, we apply a dual-modal consistency strategy to selectively correct their labels, leveraging alignment between visual and textual modalities to guide refinement while minimizing reliance on potentially biased internal knowledge.
(3) Model Adaptation. The model is progressively fine-tuned using the jointly curated dataset of selected clean samples and corrected noisy samples, promoting robust adaptation to the target task.
Extensive experiments on nine benchmark datasets (both synthetic and real-world noise) demonstrate that DKAF consistently outperforms state-of-the-art multimodal noisy label learning methods. Notably, under high-noise conditions, DKAF achieves average accuracy improvements of 3.28\%.

AAAI 2026

Mitigating Endogenous Confirmation Bias in Noisy Label Learning for Vision-Language Models

ml: learning with noisy labels

ml: multimodal learning

cv: multi-modal vision

Pretrained vision-language models (VLMs), especially CLIP, excel at adapting to downstream tasks through fine-tuning with sufficient high-quality labeled data. 
However, real-world training data often contains noisy labels, leading to significant performance degradation when models are naively fine-tuned on them. 
Existing noisy label learning methods for VLMs typically leverage the model's own pretrained knowledge, either via zero-shot predictions or vanilla self-training based on them, to identify and handle noisy samples. 
Crucially, these approaches blindly trust the VLM's pretrained knowledge, which can introduce endogenous confirmation bias: erroneous pretrained priors lead to incorrect noise detection, further amplifying the bias and corrupting the model.
To overcome this limitation, we propose the Debiased Knowledge Adaptation Framework (DKAF), which empowers the model to challenge and correct potentially flawed zero-shot predictions. 
DKAF operates in three progressive phases:
(1) Clean Sample Selection. We introduce a cross-modal collaborative pseudo-labeling to train a robust noisy label detector, explicitly mitigating confirmation bias by aggregating diverse signals beyond the model's initial zero-shot view.
(2) Noisy Label Refinement. For samples identified as noisy, we apply a dual-modal consistency strategy to selectively correct their labels, leveraging alignment between visual and textual modalities to guide refinement while minimizing reliance on potentially biased internal knowledge.
(3) Model Adaptation. The model is progressively fine-tuned using the jointly curated dataset of selected clean samples and corrected noisy samples, promoting robust adaptation to the target task.
Extensive experiments on nine benchmark datasets (both synthetic and real-world noise) demonstrate that DKAF consistently outperforms state-of-the-art multimodal noisy label learning methods. Notably, under high-noise conditions, DKAF achieves average accuracy improvements of 3.28\%.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Self-interpretable models are increasingly valued for their inherent explainability. Among them, part-prototype networks stand out by mimicking human reasoning through the use of learned prototypes. However, their explanations often lack stability, becoming sensitive to subtle input perturbations. In this work, we propose Prototype in Imagery Network (PINet), a framework that improves the stability of prototype-based explanations. Rather than training on all possible input variations, which is computationally infeasible, PINet draws inspiration from visual mental imagery. Specifically, we incorporate empty inputs and apply coarse location guidance to simulate the human ability to imagine rough object features (a process akin to Phantasia). PINet mimics this process by incorporating empty inputs and applying coarse location guidance. These imagined, or uncertain, representations are contrasted with those derived from actual inputs (certain representations). We model the differences between the two by computing similarity at both the feature and prototype levels, allowing uncertainty to be explicitly encoded during prototype learning. Comprehensive evaluations on CUB-200-2011 and Stanford Cars demonstrate that PINet consistently achieves robust accuracy and localization, even under noisy conditions. These results represent the ability of PINet to produce stable and interpretable explanations under uncertainty.

PINet: Improving the Stability of Prototype Networks via Phantasia-Inspired Uncertain Representations

Wikipedia serves as the world's largest and most popular online reference encyclopedia, rich in structured knowledge and authoritative citations. Recently, numerous works have leveraged large language models to automatically generate Wikipedia-like articles. However, existing approaches primarily focus on producing singular narrative-type content, overlooking higher information-density structured elements such as timeline and table.
To address these limitations, we propose WikiMAG, a multi-agent guided framework for generating structured Wikipedia-like articles. This framework employs a collaborative multi-agent mechanism to orchestrate the creation process, featuring three synergistic core components: Progressive planner first constructs the coarse-grained outline framework and then annotate fine-grained types for outline units, encompassing narrative, timeline, and table formats; Reflective inspector dynamically curates high-quality references via multi-round interactive feedback, thereby enhancing the authority and relevance of citations; Versatile writer integrates fine-grained outline details and high-quality reference information to generate information-rich articles, incorporating the three annotated formats. We evaluate WikiMAG on two public datasets, FreshWiki and WikiGenBen, across outline, writing, and verifiability dimensions.
Compared with the best baseline method, our method achieves an average improvement of 6.73 points and 4.39 points in Heading Soft Recall and the METEOR metric (a machine translation and text generation evaluation metric) respectively, and an average increase of 16.84 percentage points in Citation Rate.

WikiMAG: A Multi-Agent Guided Framework for Generating Structured Wikipedia-like Articles

Contrastive Language-Image Pretraining (CLIP) has demonstrated impressive generalization on vision-language tasks by aligning images and short texts. However, its inherent 77-token length constraint limits the capacity to capture complex semantics in long captions. Existing long-text adaptations for CLIP typically rely on either multi-stage training or truncation-based alignment, both inevitably resulting in semantic degradation and cumbersome tuning. Therefore, we propose One-Stage Long-Text Adaptation for CLIP (OneLIP), a unified framework that extends CLIP to understand long captions within a single training stage, eliminating the need for brittle truncation or multi-stage pipelines. OneLIP addresses semantic degradation by introducing two key innovations: (1) Token Refinement and Importance-guided Modeling (TRIM) module, which selects and refines informative tokens via SVD-based contribution scoring and cross-modal relevance modeling; (2) Per-sample Online Hard Negative Mining (PO-HNM) strategy dynamically maintains sample-specific negatives based on dual-consistency difficulty tracking, which is superior to other strategies in long-text scenarios where key semantics are distributed in more scattered positions. Extensive experiments on long-text image retrieval, short-text image retrieval, zero-shot classification, and text-to-image generation demonstrate OneLIP's robustness and versatility across diverse input lengths, offering a semantically faithful solution for long-text adaptation for CLIP.

OneLIP: Unlocking and Improving Long-Text Representations of CLIP via One-Stage Adaptation

Assessing enterprise resilience under uncertainty necessitates capturing both intrinsic attributes and evolving inter-enterprise dependencies. However, real-world enterprise systems pose substantial structural challenges: redundant or loosely correlated links can trigger spurious relational inferences, while missing or latent dependencies often hinder the propagation of informative signals. Moreover, most existing approaches adopt static graph priors or decouple structural refinement from semantic learning, lacking a co-evolutionary paradigm that allows structure and representation to inform one another. We propose **CFU**, a novel **C**o-evolving **F**ramework under **U**ncertainty, which reconceptualizes graph structure as a dynamic and learnable component evolving alongside node semantics. CFU begins with a structure-aware contrastive pretraining phase to distill latent relational semantics without supervision. It then performs bidirectional structural refinement, filtering structurally redundant edges through semantic agreement scoring, and uncovering temporally contingent, task-relevant dependencies via similarity-guided inference. These operations are integrated through a dynamic fusion process that continuously aligns the evolving topology with the resilience objective. By embedding structural adaptation within the learning loop, CFU enables context-aware resilience assessment across incomplete, ambiguous, and structurally volatile enterprise environments. Extensive experiments on real-world datasets demonstrate its superior performance across diverse evaluation scenarios.

Beyond Graph Priors: A Co-Evolving Framework Under Uncertainty for Enterprise Resilience Assessment

World models have been developed to support sample-efficient deep reinforcement learning agents. However, it remains challenging for world models to accurately replicate environments that are high-dimensional, non-stationary, and composed of multiple objects with rich interactions since most world models learn holistic representations of all environmental components. By contrast, humans perceive the environment by decomposing it into discrete objects, facilitating efficient decision-making. Motivated by this insight, we propose *Slot Transformer Imagination with CAusality-aware reinforcement learning* (STICA), a unified framework in which object-centric Transformers serve as the world model and causality-aware policy and value networks. STICA represents each observation as a set of object-centric tokens, together with tokens for the agent action and the resulting reward, enabling the world model to predict token-level dynamics and interactions. The policy and value networks then estimate token-level cause--effect relations and use them in the attention layers, yielding causality-guided decision-making. Experiments on object-rich benchmarks demonstrate that STICA consistently outperforms state-of-the-art agents in both sample efficiency and final performance.

Object-Centric World Models for Causality-Aware Reinforcement Learning

Large language models (LLMs) exhibit logically inconsistent hallucinations that appear coherent yet violate reasoning principles, with recent research suggesting an inverse relationship between causal reasoning capabilities and such hallucinations. However, existing reasoning approaches in LLMs, such as Chain-of-Thought (CoT) and its graph-based variants, operate at the linguistic token level rather than modeling the underlying causal relationships between variables, lacking the ability to represent conditional independencies or satisfy causal identification assumptions. To bridge this gap, we introduce causal-DAG construction and reasoning (CDCR-SFT), a supervised fine-tuning framework that trains LLMs to explicitly construct variable-level directed acyclic graph (DAG) and then perform reasoning over it. Moreover, we present a dataset comprising 25,368 samples (CausalDR), where each sample includes an input question, explicit causal DAG, graph-based reasoning trace, and validated answer. Experiments on four LLMs across eight tasks show that CDCR-SFT improves the causal reasoning capability with the state-of-the-art 95.33% accuracy on CLADDER (surpassing human performance of 94.8\% for the first time) and reduces the hallucination on HaluEval with 10% improvements. It demonstrates that explicit causal structure modeling in LLMs can effectively mitigate logical inconsistencies in LLM outputs. Code at https://github.com/MrLYG/CDCR-SFT.

Mitigating Hallucinations in Large Language Models via Causal Reasoning

Although Generative Flow Networks (GFlowNets) are designed to capture multiple modes of a reward function, they often suffer from mode collapse in practice, getting trapped in early-discovered modes and requiring prolonged training to find diverse solutions. Existing exploration techniques often rely on heuristic novelty signals. We propose Loss-Guided GFlowNets (LGGFN), a novel approach where an auxiliary GFlowNet's exploration is directly driven by the main GFlowNet's training loss. By prioritizing trajectories where the main model exhibits high loss, LGGFN focuses sampling on poorly understood regions of the state space. This targeted exploration significantly accelerates the discovery of diverse, high-reward samples. Empirically, across diverse benchmarks including grid environments, structured sequence generation, Bayesian structure learning, and biological sequence design, LGGFN consistently outperforms baselines in exploration efficiency and sample diversity..For instance, on a challenging sequence generation task, it discovered over 40 times more unique valid modes while simultaneously reducing the exploration error metric by approximately 99\%.

Loss-Guided Auxiliary Agents for Overcoming Mode Collapse in GFlowNets

With the rapid rise of social media and Internet culture, memes have become a popular medium for expressing emotional tendencies. This has sparked growing interest in Meme Emotion Understanding (MEU), which aims to classify the emotional intent behind memes by leveraging their multimodal contents. While existing efforts have achieved promising results, two major challenges remain: (1) a lack of fine-grained multimodal fusion strategies, and (2) insufficient mining of memes’ implicit meanings and background knowledge. To address these challenges, we propose $\textit{MemoDetector}$, a novel framework for advancing MEU. First, we introduce a four-step textual enhancement module that utilizes the rich knowledge and reasoning capabilities of Multimodal Large Language Models (MLLMs) to progressively infer and extract implicit and contextual insights from memes. These enhanced texts significantly enrich the original meme contents and provide valuable guidance for downstream classification. Next, we design a dual-stage modal fusion strategy: the first stage performs shallow fusion on raw meme image and text, while the second stage deeply integrates the enhanced visual and textual features. This hierarchical fusion enables the model to better capture nuanced cross-modal emotional cues. Experiments on two datasets, MET-MEME and MOOD, demonstrate that our method consistently outperforms state-of-the-art baselines. Specifically, $\textit{MemoDetector}$ improves F1 scores by 4.3\% on MET-MEME and 3.4\% on MOOD. Further ablation studies and in-depth analyses validate the effectiveness and robustness of our approach, highlighting its strong potential for advancing MEU. Our code is available at https://github.com/singing-cat/MemoDetector.

Enhancing Meme Emotion Understanding with Multi-Level Modality Enhancement and Dual-Stage Modal Fusion

Text-to-Image (T2I) generation models have achieved significant advancements. Correspondingly, many automated methods emerge to evaluate the image-text alignment capabilities of generative models. However, the performance comparison among these automated methods is constrained by the limited scale of existing datasets. Additionally, existing datasets lack the capacity to assess the performance of automated methods at a fine-grained level. In this study, we contribute an EvalMuse-40K dataset, gathering 40K image-text pairs with fine-grained human annotations for image-text alignment-related tasks. In the construction process, we employ various strategies such as balanced prompt sampling and data re-annotation to ensure the diversity and reliability of our dataset. This allows us to comprehensively evaluate the performance of image-text alignment methods for T2I models. Based on this dataset, we introduce an efficient automated evaluation method termed FGA-BLIP2, which enables Fine-Grained Alignment evaluation solely by inputting images and text leveraging BLIP2, without visual question answering for each fine-grained element. Experimental results show the proposed FGA-BLIP2 efficiently achieves good performance on multiple image-text alignment datasets. Meanwhile, benefiting from the high efficiency and fine-grained evaluation capability of FGA-BLIP2, we apply it as a reward model to improve text-to-image models, which effectively enhances the image-text alignment ability of text-to-image models.

EvalMuse-40K: A Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Alignment Evaluation

Log parsing converts semi-structured logs into structured templates, forming a critical foundation for downstream analysis. Traditional syntax and semantic-based parsers often struggle with semantic variations in evolving logs and data scarcity stemming from their limited domain coverage. Recent large language model (LLM)-based parsers leverage in-context learning (ICL) to extract semantics from examples, demonstrating superior accuracy. However, LLM-based parsers face two main challenges: 1) underutilization of ICL capabilities, particularly in dynamic example selection and cross-domain generalization, leading to inconsistent performance; 2) time-consuming and costly LLM querying. To address these challenges, we present MicLog, the first progressive meta in-context learning (ProgMeta-ICL) log parsing framework that combines meta-learning with ICL on small open-source LLMs (i.e., Qwen-2.5-3B). Specifically, MicLog: i) enhances LLMs' ICL capability through a zero-shot to k-shot ProgMeta-ICL paradigm, employing weighted DBSCAN candidate sampling and enhanced BM25 demonstration selection; ii) accelerates parsing via a multi-level pre-query cache that dynamically matches and refines recently parsed templates. Evaluated on Loghub-2.0, MicLog achieves 10.3\% higher parsing accuracy than the state-of-the-art parser while reducing parsing time by 42.4\%.

Content not yet available

Next from AAAI 2026

PINet: Improving the Stability of Prototype Networks via Phantasia-Inspired Uncertain Representations

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES