Singapore

We address payoff-based decentralized learning in infinite-horizon zero-sum Markov games. In this setting, each player makes decisions based solely on received rewards, without observing the opponent&#39;s strategy or actions, nor sharing information. Prior works established polynomial-time convergence to an approximate Nash equilibrium under strong reachability and mixing time assumptions. We propose a convergent algorithm that significantly relaxes these assumptions, requiring only the existence of a single policy with bounded reachability and mixing time. Our key algorithmic novelty is introducing Tsallis entropy regularization to smooth the best-response policy updates. By suitably tuning this regularization, we ensure sufficient exploration, thus bypassing previous stringent assumptions on the MDP. We prove a polynomial-time convergence to an approximate Nash equilibrium by establishing novel properties of the value and policy updates induced by the Tsallis entropy regularizer.

AAAI 2026

Learning in Zero-Sum Markov Games: Relaxing Strong Reachability and Mixing Time Assumptions

zero-sum games

nash equilibrium

game theory

We address payoff-based decentralized learning in infinite-horizon zero-sum Markov games. In this setting, each player makes decisions based solely on received rewards, without observing the opponent's strategy or actions, nor sharing information. Prior works established polynomial-time convergence to an approximate Nash equilibrium under strong reachability and mixing time assumptions. We propose a convergent algorithm that significantly relaxes these assumptions, requiring only the existence of a single policy with bounded reachability and mixing time. Our key algorithmic novelty is introducing Tsallis entropy regularization to smooth the best-response policy updates. By suitably tuning this regularization, we ensure sufficient exploration, thus bypassing previous stringent assumptions on the MDP. We prove a polynomial-time convergence to an approximate Nash equilibrium by establishing novel properties of the value and policy updates induced by the Tsallis entropy regularizer.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Few-shot graph learning remains a fundamental yet challenging problem, especially under heterophilic graph settings where connected nodes are likely to belong to different classes. In such scenarios, two key challenges arise: (1) unreliable or noisy graph structures that hinder effective message passing, and (2) semantic inconsistency: in heterophilic graphs, aggregating messages from neighbors of different classes entangles representations and introduces misleading semantics. These issues are further exacerbated by the limited labeled data inherent to few-shot learning, making it difficult to adaptively repair structure or disentangle semantics. To address these challenges, we propose DAPrompt, a Dual Alignment Prompt framework that jointly calibrates graph structure and semantic representations across the learning pipeline. In the pretraining stage, DAPrompt incorporates a graph structure learning module to denoise and repair the underlying topology, enhancing structural reliability. In the prompt tuning stage, we introduce two coordinated modules: a structure-aware prompt learner, which employs prompt tokens to repair unreliable graph structures and capture structure-level alignment, and a semantics-aligned prompt learner, which enhances the graph using target node semantics to mitigate representation noise caused by class-mismatched propagation. Extensive experiments on both node-level and graph-level few-shot benchmarks validate its effectiveness, achieving state-of-the-art performance and highlighting the value of structure-semantic dual alignment in heterophilic few-shot graph learning.

DAPrompt: Dual Alignment Prompt of Structure and Semantics for Few-shot Graph Learning

The advent of multimodal large language models (MLLMs) has sparked interest in their application to electrocardiogram (ECG) analysis. 
However, existing ECG-focused MLLMs primarily focus on report generation tasks, often limited to single 12-lead, short-duration (10s) ECG inputs, thereby underutilizing the potential of MLLMs. To this end, we aim to develop a MLLM for ECG analysis that supports a broader range of tasks and more flexible ECG inputs. 
However, existing ECG-QA datasets are often monotonous. To address this gap, we first constructed the anyECG dataset, which encompasses a wide variety of tasks, including report generation, abnormal waveform localization, and open-ended question answering. In addition to standard hospital ECGs, we introduced long-duration reduced-lead ECGs for home environments and multiple ECG comparison scenarios commonly encountered in clinical practice.
Furthermore, we propose the anyECG-chat model, which supports dynamic-length ECG inputs and multiple ECG inputs. We trained the model using a three-stage curriculum training recipe with the anyECG dataset.
A comprehensive evaluation was conducted, demonstrating that anyECG-chat is capable of supporting various practical application scenarios, including not only common report generation tasks but also abnormal waveform localization for long-duration reduced-lead ECGs in home environments and comprehensive comparative analysis of multiple ECGs.

anyECG-chat: A Generalist ECG-MLLM for Flexible ECG Input and Multi-Task Understanding

Recent researchers have proposed using event cameras for person re-identification (ReID) due to their promising performance and better balance in terms of privacy protection, event camera-based person ReID has attracted significant attention. Currently, mainstream event-based person ReID algorithms primarily focus on fusing visible light and event stream, as well as preserving privacy. Although significant progress has been made, these methods are typically trained and evaluated on small-scale or simulated event camera datasets, making it difficult to assess their real identification performance and generalization ability. To address the issue of data scarcity, this paper introduces a large-scale RGB-event based person ReID dataset, called EvReID. The dataset contains 118,988 image pairs and covers 1200 pedestrian identities, with data collected across multiple seasons, scenes, and lighting conditions. We also evaluate 15 state-of-the-art person ReID algorithms, laying a solid foundation for future research in terms of both data and benchmarking. Based on our newly constructed dataset, this paper further proposes a pedestrian attribute-guided contrastive learning framework to enhance feature learning for person re-identification, termed TriPro-ReID. This framework not only effectively explores the visual features from both RGB frames and event streams, but also fully utilizes pedestrian attributes as mid-level semantic features. Extensive experiments on the EvReID dataset and MARS datasets fully validated the effectiveness of our proposed RGB-Event person ReID framework. The benchmark dataset and source code will be released upon acceptance.

When Person Re-Identification Meets Event Camera: A Benchmark Dataset and an Attribute-Guided Re-Identification Framework

Fairness concerns are increasingly critical as machine learning models are deployed in high-stakes applications. While existing fairness-aware methods typically intervene at the model level, they often suffer from high computational costs, limited scalability, and poor generalization. To address these challenges, we propose a Bayesian data selection framework that ensures fairness by aligning group-specific posterior distributions of model parameters and sample weights with a shared central distribution. Our framework supports flexible alignment via various distributional discrepancy measures, including Wasserstein distance, maximum mean discrepancy, and f-divergence, allowing geometry-aware control without imposing explicit fairness constraints. This data-centric approach mitigates group-specific biases in training data and improves fairness in downstream tasks, with theoretical guarantees. Experiments on benchmark datasets show that our method consistently outperforms existing data selection and model-based fairness methods in both fairness and accuracy.

Fair Bayesian Data Selection via Generalized Discrepancy Measures

Most visual models are designed for sRGB images, yet RAW data offers significant advantages for object detection by preserving sensor information before ISP processing. This enables improved detection accuracy and more efficient hardware designs by bypassing the ISP. However, RAW object detection is challenging due to limited training data, unbalanced pixel distributions, and sensor noise. To address this, we propose SimROD, a lightweight and effective approach for RAW object detection. We introduce a Global Gamma Enhancement (GGE) module, which applies a learnable global gamma transformation with only four parameters, improving feature representation while keeping the model efficient. Additionally, we leverage the green channel's richer signal to enhance local details, aligning with the human eye’s sensitivity and Bayer filter design. Extensive experiments on multiple RAW object detection datasets and detectors demonstrate that SimROD outperforms state-of-the-art methods like RAW-Adapter and DIAP while maintaining efficiency. Our work highlights the potential of RAW data for real-world object detection.

SimROD: A Simple Baseline for Raw Object Detection with Global and Local Enhancements

Large Language Models (LLMs) have demonstrated remarkable In-Context learning (ICL) capabilities for relation extraction (RE). While ICL has shown promise in RE tasks, current approaches face challenges in example selection and utilization. These challenges stem from the misalignment between example selection methods and LLMs' inherent cognitive processing mechanisms, particularly in pattern recognition and relational reasoning. To address these limitations, we propose Counterfactual Cognitive Alignment (CCA), a novel framework that systematically enhances ICL performance in RE by aligning example selection with cognitive principles underlying human relational reasoning. The framework incorporates a cognitive-inspired counterfactual generation mechanism that creates semantically diverse yet relationally coherent examples, mirroring human "what-if" reasoning processes. Additionally, it employs a cognitive alignment approach that integrates structural identification features with semantic understanding to better align with LLMs cognitive processing patterns. Extensive experiments across multiple RE benchmarks reveal the effectiveness of our cognitive alignment approach through the synergistic integration of counterfactual reasoning and cognitively-guided selection.

Counterfactual-based Cognitive Alignment In-Context Learning for Relation Extraction

In the era of large-scale visual data, understanding collections of images is a challenging yet important task. To this end, we introduce ImageSet2Text, a novel method to automatically generate natural language descriptions of image sets. Based on large language models, visual-question answering chains, an external lexical graph, and CLIP-based verification, ImageSet2Text iteratively extracts key concepts from image subsets and organizes them into a structured concept graph. We conduct extensive experiments evaluating the quality of the generated descriptions in terms of accuracy, completeness, and user satisfaction. We also examine the method's behavior through ablation studies, scalability assessments, and failure analyses. Results demonstrate that ImageSet2Text combines data-driven AI and symbolic representations to reliably summarize large image collections for a wide range of applications.

ImageSet2Text: Describing Sets of Images Through Text

Graph Transformers (GTs), which simultaneously integrate message passing and self-attention mechanisms, have achieved promising empirical results in graph prediction tasks. However, the design of scalable and structure-aware node tokenization strategies has lagged behind other modalities. This gap becomes critical as the quadratic complexity of full attention renders GTs impractical on large-scale graphs. Recently, Spiking Neural Networks (SNNs), as brain-inspired models, provided an energy-saving scheme to convert changes of sequential input intensity into discrete spike-based representations through event-driven spiking neurons. Inspired by these characteristics, we propose **GT-SNT**, a linear-time Graph Transformer with Spiking Node Tokenization. By integrating random feature-based positional encoding with SNNs, the spiking node tokenizer extracts compact, structure-aware spike count embeddings as node tokens and mitigates the issue of *codebook collapse*. These tokens are used to dynamically reconstruct the codebook within a codebook-guided self-attention mechanism, enabling efficient global context aggregation with linear complexity. In experiments, we compare GT-SNT with other state-of-the-art baselines on node classification datasets ranging from small to large. Experimental results show that GT-SNT has achieved competitive performances on most datasets while maintaining up to **130×** faster inference speed compared to other GTs.

GT-SNT: A Linear-Time Transformer for Large-Scale Graphs via Spiking Node Tokenization

Accurately grounding regions of interest (ROIs) is critical for diagnosis and treatment planning in medical imaging. While multimodal large language models (MLLMs) combine visual perception with natural language, current medical-grounding pipelines still rely on supervised fine-tuning with explicit spatial hints, making them ill-equipped to handle the implicit queries common in clinical practice.
This work makes three core contributions. We first define **Unified Medical Reasoning Grounding (UMRG)**, a novel vision–language task that demands clinical reasoning and pixel-level grounding. Second, we release **U-MRG-14K**, a dataset of 14K samples featuring pixel-level masks alongside implicit clinical queries and reasoning traces, spanning 10 modalities, 15 super-categories, and 108 specific categories. Finally, we introduce **MedReasoner**, a modular framework that distinctly separates reasoning from segmentation: an MLLM reasoner is optimized with reinforcement learning, while a frozen segmentation expert converts spatial prompts into masks, with alignment achieved through format and accuracy rewards. MedReasoner achieves state-of-the-art performance on U-MRG-14K and demonstrates strong generalization to unseen clinical queries, underscoring the significant promise of reinforcement learning for interpretable medical grounding.

MedReasoner: Reinforcement Learning Drives Reasoning Grounding from Clinical Thought to Pixel-Level Precision

In this work, we present EchoGen, a unified framework for layout-to-image generation and image grounding, capable of generating images with both accurate layout and high fidelity to the text description.(e.g., spatial relationship), and grounding the image robustly at the same time. We believe that image grounding possesses strong text and layout understanding abilities, which can compensate for the corresponding limitations in layout-to-image generation. At the same time, images generated from layouts exhibit high diversity in content, thereby enhancing the robustness of image grounding. Jointly training both tasks within a unified model can promote performance improvements for each. However, we identify that this joint training paradigm encounters several optimization challenges and results in restricted performance. To address these issues, we propose progressive training strategies. First, the Parallel Multi-Task Pre-training (PMTP) stage equips the model with basic abilities for both tasks, leveraging shared tokens to accelerate training. Next, the Dual Joint Optimization (DJO) stage exploits task duality to sequentially integrate the two tasks, enabling unified optimization. Finally, the Cycle RL stage eliminates reliance on visual supervision by using consistency constraints as rewards, significantly enhancing the model’s unified capabilities via the GRPO strategy. Extensive experiments demonstrate state-of-the-art results on both layout-to-image generation and image grounding benchmarks, and reveal clear synergistic gains from optimizing the two tasks together.

Content not yet available

Next from AAAI 2026

DAPrompt: Dual Alignment Prompt of Structure and Semantics for Few-shot Graph Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES