Singapore

Multi-Instance Generation has advanced significantly in spatial placement and attribute binding. However, existing approaches still face challenges in fine-grained semantic understanding, particularly when dealing with complex textual descriptions.To overcome these limitations, we propose $\textbf{DEIG}$, a novel framework for fine-grained and controllable multi-instance generation. DEIG integrates an $\textit{Instance Detail Extractor}$ (IDE) that transforms text encoder embeddings into compact, instance-aware representations, and a $\textit{Detail Fusion Module}$ (DFM) that applies instance-based masked attention to prevent attribute leakage across instances. These components enable DEIG to generate visually coherent multi-instance scenes that precisely match rich, localized textual descriptions.
To support fine-grained supervision, we construct a high-quality dataset with detailed, compositional instance captions generated by VLMs. We also introduce $\textbf{DEIG-Bench}$, a new benchmark with region-level annotations and multi-attribute prompts for both humans and objects.Experiments demonstrate that DEIG consistently outperforms existing approaches across multiple benchmarks in spatial consistency, semantic accuracy, and compositional generalization. Moreover, DEIG functions as a plug-and-play module, making it easily integrable into standard diffusion-based pipelines.

AAAI 2026

DEIG: Detail-Enhanced Instance Generation with Fine-Grained Semantic Control

computer vision (cv); diffusion models; controllable image generation

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Deep neural networks (DNNs) have significantly advanced diabetic retinopathy (DR) diagnosis, yet their black-box nature limits clinical acceptance due to a lack of interpretability. Concept bottleneck model (CBM) offers a promising solution by enabling concept-level reasoning and test-time intervention, with recent DR studies modeling lesions as concepts and grades as outcomes. However, current methods often ignore relationships between lesion concepts across different DR grades and struggle when fine-grained lesion concepts are unavailable, limiting their interpretability and real-world applicability. To bridge these gaps, we propose VLM-GCR, a vision-language model guided graph concept reasoning framework for interpretable DR diagnosis. VLM-GCR emulates the diagnostic process of ophthalmologists by constructing a grading-aware lesion concept graph that explicitly models the interactions among lesions and their relationships to disease grades. In concept-free clinical scenarios, our method introduces a vision-language guided dynamic concept pseudo-labeling mechanism to mitigate the challenges of existing concept-based models in fine-grained lesion recognition. Additionally, we introduce a multi-level intervention method that supports error correction, enabling transparent and robust human-AI collaboration. Experiments on two public DR benchmarks show that VLM-GCR achieves strong performance in both lesion and grading tasks, while delivering clear and clinically meaningful reasoning steps.

Vision-Language Models Guided Graph Concept Reasoning for Interpretable Diabetic Retinopathy Diagnosis

Detecting mirror regions in RGB videos is essential for scene understanding in applications such as scene reconstruction and robotic navigation. Existing video mirror detectors typically rely on cues like inside-outside mirror correspondences and 2D motion inconsistencies. However, these methods often yield noisy or incomplete predictions when confronted with complex real-world video scenes, especially in areas with occlusion or limited visual features and motions. We observe that human perceive and navigate 3D occluded environments with remarkable ease, owing to Motion-in-Depth (MiD) perception. MiD integrates information from visual appearance (image colors and textures), the way objects move around us in 3D space (3D motions), and their relative distance from us (depth) to determine whether something is approaching or receding and to support navigation. Motivated by this neuroscience mechanism, we introduce MiD-VMD, the first approach to explicitly model MiD for video mirror detection. MiD-VMD jointly utilizes contrastive 3D motion, depth, and image features through two novel modules based on a combinational QKV transformer architecture. The Motion-in-Depth Attention Learning (MiD-AL) module captures complementary relationships across these modalities with combinatorial attention and enforces a compact encoding to represent global 3D transformations, resulting in more accurate mirror detection and reduced motion artifacts. The Motion-in-Depth Boundary Detection (MiD-BD) module further sharpens mirror boundaries by leveraging cross-modal attention on 3D motion and depth features. Extensive experiments show that MiD-VMD outperforms current SOTAs. We will release our code.

Video Mirror Detection with the Motion-in-Depth Cue

Irregular time series (IRTS) are prevalent in real-world applications, where uneven sampling and missing data pose fundamental challenges to deep learning-based feature modeling. Although existing methods attempt to retain timestamp information, they often overlook the structured patterns embedded within the missingness itself, and tend to perform poorly when confronted with class imbalance exacerbated by data incompleteness. Specifically, temporal irregularity hinders the modeling of long-range dependencies
and local patterns, while sparse observations limit representational capacity, disproportionately impairing minority classes and leading to severe classification bias. To address these deeply coupled challenges, we propose SPECTRA (Structured Pattern and Enriched Context-aware Temporal Representation Architecture), a unified framework for robust IRTS classification. SPECTRA introduces a frequency-guided observation encoder that reconstructs temporal dependencies in a stable manner, mitigating spectral distortion and information corruption. Complementarily, a missingness pattern encoder explicitly captures the dynamic evolution of missing data and leverages it as a discriminative signal. In addition, a prototype-constrained classification paradigm directly optimizes the geometric structure of the feature space, enhancing intra-class compactness and alleviating generalization bottlenecks caused by class imbalance. Extensive experiments on three public IRTS datasets—P12, P19, and PAM—demonstrate the superior performance of SPECTRA under both missing and imbalanced conditions.

Beyond Missing Data Imputation: Information-Theoretic Coupling of Missingness and Class Imbalance for Optimal Irregular Time Series Classification

Sequential Recommenders, which exploit dynamic user intents through interaction sequences, are vulnerable to adversarial attacks. While existing attacks primarily rely on data poisoning, they require large-scale user access or fake profiles thus lacking practicality. In this paper, we focus on the Profile Pollution Attack (PPA) that subtly contaminates partial user interactions to induce targeted mispredictions. Previous PPA methods suffer from two limitations, i.e., i) over-reliance on sequence horizon impact restricts fine-grained perturbations on item transitions, and ii) holistic modifications cause detectable distribution shifts. To address these challenges, we propose a constrained reinforcement driven attack CREAT that synergizes a bi-level optimization framework with multi-reward reinforcement learning to balance adversarial efficacy and stealthiness. We first develop a Pattern Balanced Rewarding Policy, which integrates pattern inversion rewards to invert critical patterns and distribution consistency rewards to minimize detectable shifts via unbalanced co-optimal transport. Then we employ a Constrained Group Relative Reinforcement Learning paradigm, enabling step-wise perturbations through dynamic barrier constraints and group-shared experience replay, achieving targeted pollution with minimal detectability. Extensive experiments demonstrate the effectiveness of CREAT. Our codes are available at https://anonymous.4open.science/r/CREAT-5B36.

Potent but Stealthy: Rethink Profile Pollution Against Sequential Recommendation via Bi-Level Constrained Reinforcement Paradigm

With the rise of smart personal devices, service-oriented human-agent interactions have become increasingly prevalent. This trend highlights the need for personalized dialogue assistants that can understand user-specific traits to accurately interpret requirements and tailor responses to individual preferences. However, existing approaches often overlook the complexities of long-term interactions and fail to capture users’ subjective characteristics. To address these gaps, we present PAL-Bench, a new benchmark designed to evaluate the personalization capabilities of service-oriented assistants in long-term user-agent interactions. In the absence of available real-world data, we develop a multi-step LLM-based synthesis pipeline, which is further verified and refined by human annotators. This process yields PAL-Set, the first Chinese dataset comprising multi-session user logs and dialogue histories, which serves as the foundation for PAL-Bench. Furthermore, to improve personalized service-oriented interactions, we propose H$^2$Memory, a hierarchical and heterogeneous memory framework that incorporates retrieval-augmented generation to improve personalized response generation. Comprehensive experiments on both our PAL-Bench and an external dataset demonstrate the effectiveness of the proposed memory framework. We will release PAL-Set, PAL-Bench, and H$^2$Memory to support future research in this direction.

Mem-PAL: Towards Memory-based Personalized Dialogue Assistants for Long-term User-Agent Interaction

With the advancements of large language models (LLMs), intelligent tutoring systems have witnessed significant progress. The extensive knowledge and reasoning capabilities of LLMs enable intelligent tutoring systems to generate more helpful tutoring dialogues with scaffolding instructions. However, these systems fail to provide scaffolds that align with the personalized needs of students due to the lack of attention to the long-term learning process of students. Meanwhile, the pursuit of more suitable scaffolds through complex reasoning may result in additional computational overhead. To address these issues, we propose LEAP, a Long-term Educational Adaptive Planning system that can model students' long-term learning process. Specifically, LEAP plans for scaffolds through collaboration of direct planning and thoughtful reasoning to improve efficiency and captures students' long-term learning progress through cognitive state extraction. Then we propose LEAD, a Long-term Educational Archive Dataset to alleviate the lack of data and validate the effectiveness of LEAP, which is constructed through real-world students' reactions and simulation of the teacher-student interactions. Experiments on several datasets demonstrate the effectiveness of LEAP.

Learning from Long-Term Engagement: Adaptive Tutoring Dialogue Planning for Personalized Education

Sound separation (SS) and target sound extraction (TSE) are fundamental techniques for addressing complex acoustic scenarios. While existing SS methods struggle with determining the unknown number of sound sources, TSE approaches require precisely specified clues to achieve optimal performance. This paper proposes a unified framework that synergistically combines SS and TSE to overcome their individual limitations. Our architecture employs two complementary components: 1) An Encoder-Decoder Attractor (EDA) network that automatically infers both the source count and corresponding acoustic clues for SS, and 2) A multi-modal fusion network that precisely interprets diverse user-provided clues (acoustic, semantic, or visual) for TSE. Through joint training with cross-task consistency constraints, we establish a unified latent space that bridges both paradigms. During inference, the system adaptively operates in either fully autonomous SS mode or clue-driven TSE mode. Experiments demonstrate remarkable performance in both tasks, with notable improvements of 1.4 dB SDR improvement in SS compared to baseline and 86% TSE accuracy.

USE: A Unified Model for Universal Sound Separation and Extraction

In recent years, constraint solvers show increasing use in solving various open combinatorial problems, e.g., from Ramsey theory or synthesis of combinatorial designs. The similar approach can be applied to some problems related to binary linear codes, which form one of the largest families of error correcting codes used both in coding theory and in various practical applications. Thanks to a simple algebraic structure of such codes it is possible to study them using a wide range of methods. Note that even codes with the same basic parameters (length $n$, dimension $k$, minimal code distance $d$) can show different error correction performance, i.e., the ability to correct errors which appear in a noisy channel. In the paper, we formulate the problem of finding binary linear codes with good error correction performance as a constraint optimization problem and explore the effectiveness of modern constraint solvers on it, including SAT, MaxSAT, and CP solvers. Using the respective solvers and parallel computing, for several values of $n$, $k$, $d$ we found the codes which are significantly better than the known in terms of their practical performance.

Using Constraint Solvers to Construct Binary Codes with Good Error Correction Performance

It is well understood that mental modeling forms the foundation of many everyday interactions between humans. This includes both collaborative and deceptive interactions. In fact, one could argue that the modeling and manipulation of mental states lies at the heart of effective deception. In this paper, we examine the security problem of insider threat attacks. In this case, an adversary has already infiltrated an organization. The primary challenge for this attacker is to avoid suspicion until their true goal can be achieved. We see how existing model-based explanatory methods can be leveraged to generate lies that explain away potentially suspicious activities. We also propose a novel planning formulation that generates plans that appear to achieve an assigned goal while getting close enough to achieve an alternative, covert goal. We evaluate the computational effectiveness of our formulation on multiple planning benchmarks.

Mental Model-based Generation of Lies for Insider Threat Modeling

We consider the problem of assigning items to platforms where each item has a utility associated with each of the platforms to which it can be assigned. Each platform has a soft constraint over the total number of items it serves, modeled via a convex cost function. Additionally, items are partitioned into groups, and each platform also incurs group-specific convex cost over the number of items from each group that can be assigned to the platform. These costs promote group fairness by penalizing imbalances, yielding a soft variation of fairness notions introduced in prior work, such as Restricted Dominance and Minority protection. Restricted Dominance enforces upper bounds on group representation, while Minority protection enforces lower bounds. Our approach replaces such hard constraints with cost-based penalties, allowing more flexible trade-offs. Our model also captures Nash Social Welfare kind of objective. 

The cost of an assignment is the sum of the values of all the cost functions across all the groups and platforms. The objective is to find an assignment that minimizes the cost while achieving a total utility that is at least a user-specified threshold. The main challenge lies in balancing the overall platform cost with group-specific costs, both governed by convex functions, while meeting the utility constraint. We present an efficient polynomial-time approximation algorithm, supported by theoretical guarantees and experimental evaluation. 

Our algorithm is based on techniques involving linear programming and network flows. We also provide an exact algorithm for a special case with uniform utilities and establish the hardness of the general problem when the groups can intersect arbitrarily. This work has applications in cloud computing, logistics, resource-constrained machine learning deployment, federated learning, and network design, where resources must be allocated across platforms with diverse cost structures and diminishing returns.

Downloads

Next from AAAI 2026

Vision-Language Models Guided Graph Concept Reasoning for Interpretable Diabetic Retinopathy Diagnosis

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES