Singapore

Text-to-image generation tasks have driven remarkable advances in diverse media applications, yet most focus on single-turn scenarios and struggle with iterative, multi-turn creative tasks. Recent dialogue-based systems attempt to bridge this gap, but their single-agent, sequential paradigm often causes intention drift and incoherent edits. To address these limitations, we present Talk2Image, a novel multi-agent system for interactive image generation and editing in multi-turn dialogue scenarios. Our approach integrates three key components: intention parsing from dialogue history, task decomposition and collaborative execution across specialized agents, and feedback-driven refinement based on a multi-view evaluation mechanism. Talk2Image enables step-by-step alignment with user intention and consistent image editing. Experiments demonstrate that Talk2Image outperforms existing baselines in controllability, coherence, and user satisfaction across iterative image generation and editing tasks.

AAAI 2026

Talk2Image: A Multi-Agent System for Multi-Turn Image Generation and Editing

nlp: conversational ai/dialog systems

mas: applications

cv: applications

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

We introduce theHybrid Vector-Occupancy Field (HVOF), a new implicit 3D representation for reconstructing both \textit{open and closed} surfaces from sparse point clouds. Existing approaches, such as occupancy field and signed distance fields, face severe limitations. They struggle with open surfaces, while unsigned distance field and neural vector field exhibit directional instability in complex topologies and ridge regions. HVOF addresses these challenges by incorporating a smoothly decaying occupancy field around the surface, while capturing precise local geometry using truncated displacement vectors, naturally mitigating direction-field ambiguities near ridge regions. This unified design forms a robust hybrid representation that leverages both occupancy and vector fields. To fulfill it, we design a Hybrid Field variational autoencoder including a hierarchical cross-attention encoder and dual-branch decoder that jointly learn occupancy and vector fields through continuous weighting. Extensive experiments demonstrate that HVOF consistently outperforms state-of-the-art methods across ShapeNet, ABC, and MGN datasets, accurately reconstructing both open and closed surfaces while preserving fine geometric details in complex regions.

Hybrid Vector-Occupancy Field for Robust Implicit 3D Surface Reconstruction

Computer-generated holography (CGH) is a promising technology for next-generation displays. However, generating high-speed, high-quality holographic video requires both high frame rate display and efficient computation, but is constrained by two key limitations: ($i$) Learning-based models often produce over-smoothed phases with narrow angular spectra, causing severe color crosstalk in high frame rate full-color displays such as depth-division multiplexing and thus resulting in a trade-off between frame rate and color fidelity. ($ii$) Existing frame-by-frame optimization methods typically optimize frames independently, neglecting spatial-temporal correlations between consecutive frames and leading to computationally inefficient solutions. To overcome these challenges, in this paper, we propose a novel high-speed full-color video CGH generation scheme. First, we introduce Spectrum-Guided Depth Division Multiplexing (SGDDM), which optimizes phase distributions via frequency modulation, enabling high-fidelity full-color display at high frame rates. Second, we present HoloMamba, a lightweight asymmetric Mamba-Unet architecture that explicitly models spatial-temporal correlations across video sequences to enhance reconstruction quality and computational efficiency. Extensive simulated and real-world experiments demonstrate that SGDDM achieves high-fidelity full-color display without compromise in frame rate, while HoloMamba generates FHD (1080p) full-color holographic video at over 260 FPS, more than 2.6$\times$ faster than the prior state-of-the-art Divide-Conquer-and-Merge Strategy. Code will be released.

High-Speed FHD Full-Color Video Computer-Generated Holography

Large language models (LLMs) have made significant strides in mathematical reasoning, particularly at the elementary level. However, they continue to face substantial challenges when confronted with complex, advanced mathematical problems. In contrast to humans—who can effectively draw upon prior experiences in solving similar problems and retrieve relevant knowledge and theorems from memory—LLMs often struggle to accurately identify analogous problems and to recall or apply appropriate theorems.
To overcome these limitations, we introduce a novel framework for constructing a template-theorem joint knowledge base, leveraging the capabilities of large language models. Inspired by the associative mechanisms of human cognition, our approach abstracts real-world problems into generalized templates and establishes intricate linkages between these templates and pertinent theorems. This design enables the efficient expansion of a comprehensive knowledge base, even when starting from a limited set of seed examples.
Moreover, we develop an efficient retrieval strategy that, given a new problem, systematically extracts and presents the most relevant knowledge from the knowledge base as contextual input to the LLM. Extensive experiments on multiple public mathematical datasets and models demonstrate that our approach consistently surpasses conventional methods. Comprehensive ablation studies further corroborate the effectiveness of both our knowledge base construction and retrieval modules.

Template-Theorems Graph Construction to Enhance Mathematical Reasoning Capabilities of LLM

Large-scale pre-trained diffusion models empower users to edit images through text guidance. However, existing methods often over-align with target prompts while inadequately preserving source image semantics. Such approaches generate target images explicitly or implicitly from the inversion noise of the source images, termed the inversion anchors. We identify this strategy as suboptimal for semantic preservation and inefficient due to elongated editing paths. We propose TweezeEdit, a tuning- and inversion-free framework for consistent and efficient image editing. Our method addresses these limitations by regularizing the entire denoising path rather than relying solely on the inversion anchors, ensuring source semantic retention and shortening editing paths. Guided by gradient-driven regularization, we efficiently inject target prompt semantics along a direct path using a consistency model. Extensive experiments demonstrate TweezeEdit’s superior performance in semantic preservation and target alignment, outperforming existing methods. Remarkably, it requires only 12 steps (1.6 seconds per edit), underscoring its potential for real-time applications.

TweezeEdit: Consistent and Efficient Image Editing with Path Regularization

Multimodal Knowledge Editing (MKE) extends traditional knowledge editing to settings involving both textual and visual modalities. However, existing MKE benchmarks primarily assess final answer correctness, neglecting the quality of intermediate reasoning and robustness to visually rephrased inputs. To address this limitation, we introduce MMQAKE, the first benchmark for multimodal multihop question answering with knowledge editing. MMQAKE evaluates: (1) a model’s ability to reason over 2–5-hop factual chains that span both text and images, including performance at each intermediate step; (2) robustness to visually rephrased inputs in multihop questions.
Our evaluation shows that current MKE methods often struggle to consistently update and reason over multimodal reasoning chains following knowledge edits. 
To overcome these challenges, we propose Hybrid-DMKG, a hybrid reasoning framework built on a dynamic multimodal knowledge graph (DMKG) to enable accurate multihop reasoning over updated multimodal knowledge. Hybrid-DMKG first uses a large language model to decompose multimodal multihop questions into sequential sub-questions, then applies a multimodal retrieval model to locate updated facts by jointly encoding each sub-question with candidate entities and their associated images. For answer inference, a hybrid reasoning module operates over the DMKG via two parallel paths: (1) relation-linking prediction; (2) RAG Reasoning with large vision-language models. A background-reflective decision module then aggregates evidence from both paths to select the most credible answer. Experimental results on MMQAKE show that Hybrid-DMKG significantly outperforms existing MKE approaches, achieving higher accuracy and improved robustness to knowledge updates.

Hybrid-DMKG: A Hybrid Reasoning Framework over Dynamic Multimodal Knowledge Graphs for Multimodal Multihop QA with Knowledge Editing

Graph Neural Networks (GNNs) offer superior modeling capabilities for text classification by capturing complex spatial features within semantic representations. However, existing graph-based approaches often suffer from computational inefficiency and limited ability to model both fine-grained local structures and the sequential nature of text. To address these challenges, we propose HC2-GNN, a Hierarchical Clustering and Coarsening Graph Neural Network, which introduces a novel lightweight graph clustering algorithm called Compromise Conductance Graph Clustering (C2GC). C2GC enables efficient graph clustering while simultaneously preserving both the textual order and the topological coherence of subgraphs. Furthermore, it incorporates a virtue cluster mechanism that expands each subgraph with semantically relevant neighbors, explicitly enabling cross-cluster information propagation without compromising local structural integrity. HC2-GNN aggregates local and global features by combining subgraph-level and full-graph representations, enhancing semantic discriminability for classification. Extensive experiments on benchmark datasets demonstrate that HC2-GNN consistently outperforms existing state-of-the-art text classification methods. Code and data will be released publicly upon publication.

HC2-GNN: Hierarchical Graph Representation Learning for Efficient Text Classification

Recovering fine-grained details in extremely low-light images remains challenging due to severe structural information loss and noise corruption. Existing enhancement methods often fail to preserve intricate details and sharp edges, limiting their effectiveness in downstream applications like text and edge detection. To address these deficiencies, we propose an efficient dual-stage approach centered on detail recovery for low-light images. In the first stage, we introduce a Residual Fourier-Guided Module (RFGM) that effectively restores global illumination in the frequency domain. RFGM captures inter-stage and inter-channel dependencies through residual connections, providing robust priors for high-fidelity frequency processing while mitigating error accumulation risks from unreliable priors. The second stage employs complementary Mamba modules specifically designed for textural structure refinement: (1) Patch Mamba operates on channel-concatenated non-downsampled patches, meticulously modeling pixel-level correlations to enhance fine-grained details without resolution loss. (2) Grad Mamba explicitly focuses on high-gradient regions, alleviating state decay in state space models and prioritizing reconstruction of sharp edges and boundaries. Extensive experiments on multiple benchmark datasets and downstream applications demonstrate that our method significantly improves detail recovery performance while maintaining efficiency. Crucially, the proposed modules are lightweight and can be seamlessly integrated into existing Fourier-based frameworks with minimal computational overhead.

Beyond Illumination: Fine-Grained Detail Preservation in Extreme Dark Image Restoration

Snapshot compressive imaging (SCI) captures multispectral images (MSIs) using a single coded two-dimensional (2-D) measurement, but reconstructing high-fidelity MSIs from these compressed inputs remains a fundamentally ill-posed challenge. While diffusion-based reconstruction methods have recently raised the bar for quality, they face critical limitations: a lack of large-scale MSI training data, adverse domain shifts from RGB-pretrained models, and inference inefficiencies due to multi-step sampling. These drawbacks restrict their practicality in real-world applications. In contrast to existing methods—which either follow costly iterative refinement or adapt subspace-based embeddings for diffusion models (e.g. DiffSCI, PSR‑SCI)—we introduce a fundamentally different paradigm: a self-supervised One-Step Diffusion (OSD) framework designed specifically for SCI. The key novelty lies in using a single-step diffusion refiner to correct an initial reconstruction, eliminating iterative denoising entirely while preserving generative quality. Moreover, we adopt a self-supervised equivariant learning strategy to train both the predictor and refiner directly from raw 2-D measurements, enabling generalization to unseen domains without ground-truth MSI. To further address limited MSI data, we design a band-selection–driven distillation strategy that transfers core generative priors from large-scale RGB datasets, effectively bridging the domain gap. Extensive experiments confirm that our approach sets a new standard—yielding PSNR gains of 3.44dB, 1.61dB, and 0.33dB on the Harvard, NTIRE, and ICVL datasets respectively—while cutting reconstruction time by 97.5\%, from 8.9s to just 0.22s per image. This leap in efficiency and adaptability makes our method a major advancement in SCI reconstruction—both accurate and practical for real-world deployment.

Self-Supervised One-Step Diffusion Refinement for Snapshot Compressive Imaging

Existing community search methods heavily rely on labeled data or predefined structures, thus fail to capture obscure and dynamic community boundaries in open-world heterogeneous networks, leading to poor adaptability. They also ignore modeling behavioral patterns, resulting in poor search performance. To solve the above issues, this work formally defines the unsupervised behavior-driven community search problem for heterogeneous graphs and designs dual-view Contrastive Learning-based Unsupervised framework for Heterogeneous graph Community Search (CLUHCS). From two perspectives, CLUHCS designs a relation view to encode local community cohesion, as well as a meta-path view to capture global behavior semantics. By using PathSim averaging strategy to generate positive samples and self-supervised signals, we can completely eliminate label dependency. Then, contrastive training is leveraged to automatically learn community representations and solve the open community boundary ambiguity challenge. Furthermore, by capturing behavior patterns, the meta-path behavior modeling flexibly characterizes the formation mechanism of heterogeneous communities. Experiments on three datasets verify the effectiveness and efficiency of CLUHCS. CLUHCS significantly improves F1-score by 52.7\% over the unsupervised baseline FCS-HGNN and by 41.5\% over the supervised method TransZero.

CLUHCS:Dual-View Contrastive Learning Enabled Unsupervised Heterogeneous Community Search with Meta-Path Behavior Modeling

Large Reasoning Models (LRMs) have recently demonstrated impressive performance across a range of reasoning tasks by generating intermediate thoughts. However, these models can suffer from overthinking—generating excessive tokens that contribute little to final accuracy while increasing inference cost. To mitigate this, we propose TIV (Thought Injection via Vectors), an innovative framework that compresses token-level reasoning into compact vectors without sacrificing performance. Rather than generating explicit thoughts, TIV injects learnable vectors into the post-attention hidden states of the final token across Transformer layers, enabling implicit and lightweight reasoning. We further introduce a two-stage reinforcement learning strategy: the first stage calibrates the model's reasoning distribution, and the second distills it into a vector-based policy optimized for both accuracy and brevity. Experiments on three reasoning benchmarks show that TIV preserves over 99% of the original accuracy while reducing output length by more than 65% on average, reaching up to 80% in some cases. Moreover, TIV consistently achieves superior trade-offs between accuracy and efficiency compared to existing methods, distinguishing itself as a state-of-the-art (SOTA) approach for efficient reasoning in LRMs.

Content not yet available

Next from AAAI 2026

Hybrid Vector-Occupancy Field for Robust Implicit 3D Surface Reconstruction

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES