Singapore

Despite recent advances in text-to-image (T2I) generation, models still struggle to accurately render prompt-specified text with correct spatial layout—especially in multi-span, structured settings. 
This challenge is driven not only by the lack of datasets that align prompts with the exact text and layout expected in the image, but also by the absence of effective metrics for evaluating layout quality.
To address these issues, we introduce TextGround4M, a large-scale dataset of over 4 million prompt-image pairs, each annotated with span-level text grounded in the prompt and corresponding bounding boxes.
This enables fine-grained supervision for layout-aware, prompt-grounded text rendering.
Building on this, we propose a lightweight training strategy for autoregressive T2I models that appends layout-aware span tokens during training, without altering model architecture or inference behavior.
We further construct a benchmark with stratified layout complexity to evaluate both open-source and proprietary models in a zero-shot setting. 
In addition, we introduce two layout-aware metrics to address the long-standing lack of spatial evaluation in text rendering.
Our results show that models trained on TextGround4M outperform strong baselines in text fidelity, spatial accuracy, and prompt consistency, highlighting the importance of fine-grained layout supervision for grounded T2I generation.

AAAI 2026

TextGround4M: A Prompt-Aligned Dataset for Layout-Aware Text Rendering

text rendering

prompt grounded

text to image generation

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The emergence of multimodal technologies has propelled Vision-Language Incremental Learning (VLIL) into a research spotlight. Current VLIL approaches predominantly inherit unimodal paradigms, failing to address fundamental distinctions between visual and linguistic modalities. Crucially, the semantic gap between images and text creates divergent learning dynamics: visual data exhibits rich, distributed information while textual representations remain explicit and compact. Consequently, textual elements align with class-specific tasks, whereas individual images inherently span multiple such tasks, creating dual bottlenecks in class-level memory allocation and scene-level knowledge transfer. To overcome these challenges, we propose ​DCIM (Dual Class-Individual Memory)​, a novel framework featuring complementary mechanisms for vision-language continual learning. For class-level constraints, our ​Hierarchical Class Memory Management (HCMM)​​ strategy dynamically allocates memory resources across object categories. It employs forgetting simulation to identify and preserve the most vulnerable samples, ensuring robust long-term knowledge retention. For scene-level adaptation, the ​Scene Reconstruction Memory(SRM)​​ module captures generalized environmental representations, enabling contextual transfer to novel classes and disambiguation of semantically related concepts within shared scenes.Extensive experiments on two vision-language tasks, i.e., visual question answering (VQA) and Image captioning (IC), demonstrate the effectiveness and excellent generalization ability of our approach, achieving superior performance under continual learning settings.

Vision-language Incremental Learning with Dual Class-individual Memory

Three-dimensional atomic arrangements of biomolecules are key to demystifying biological functions. The rapid expansion of accessible structural data, driven by advances in AI for science, highlights the critical challenge of efficiently modeling large-scale biomolecular structures, which are high-dimensional systems shaped by biological assembly principles. To address this, we introduce BiHiTo, a multi-level Biomolecular Hierarchy-inspired Tokenizer that intrinsically mimics natural biological assembly hierarchies. Specifically, we design a multi-codebook quantizer that mirrors the natural hierarchy of biomolecular structure, enabling simultaneous capture of representations spanning atomic motifs to global conformational variations. This hierarchical alignment markedly improves the biological interpretability and reconstruction fidelity of biomolecular structure.Extensive experiments demonstrate that BiHiTo delivers state-of-the-art performance and robust generalization across molecular dynamics trajectories and macromolecular complexes, facilitating advances in structure generation and dynamic conformation exploration. In the reconstruction of the CASP14 and OOD test set FastFolding protein multi-conformation data, our method achieves a 17% and 51% reduction in RMSD compared to Bio2Token, respectively.

BiHiTo: Biomolecular Hierarchy-inspired Tokenization

Large language models (LLMs) are increasingly deployed in real-world communication settings, yet their ability to resolve context-dependent ambiguity remains underexplored. In this work, we present EMODIS, a new benchmark for evaluating LLMs’ capacity to interpret ambiguous emoji expressions under minimal but contrastive textual contexts. Each instance in EMODIS comprises an ambiguous sentence containing an emoji, two distinct disambiguating contexts that lead to divergent interpretations, and a specific question that requires contextual reasoning. We evaluate both open-source and API-based LLMs, and find that even the strongest models frequently fail to distinguish meanings when only subtle contextual cues are present. Further analysis reveals systematic biases toward dominant interpretations and limited sensitivity to pragmatic contrast. EMODIS provides a rigorous testbed for assessing contextual disambiguation, and highlights the gap between human-level and model-level semantic reasoning.

EMODIS: A Benchmark for Context-Dependent Emoji Disambiguation in Large Language Models

Recent advances in video generation techniques have given rise to an emerging paradigm of generative video coding for Ultra-Low Bitrate (ULB) scenarios by leveraging powerful generative priors. However, most existing methods are limited by domain specificity (e.g., facial or human videos) or excessive dependence on high-level text guidance, which tend to inadequately capture fine-grained motion details, leading to unrealistic or incoherent reconstructions. To address these challenges, we propose $\textbf{T}$rajectory-Guided $\textbf{G}$enerative $\textbf{V}$ideo $\textbf{C}$oding (dubbed T-GVC), a novel framework that bridges low-level motion tracking with high-level semantic understanding. T-GVC features a semantic-aware sparse motion sampling pipeline that extracts pixel-wise motion as sparse trajectory points based on their semantic importance, significantly reducing the bitrate while preserving critical temporal semantic information. In addition, by integrating trajectory-aligned loss constraints into diffusion processes, we introduce a training-free guidance mechanism in latent space to ensure physically plausible motion patterns without sacrificing the inherent capabilities of generative models. Experimental results demonstrate that T-GVC outperforms both traditional and neural video codecs under ULB conditions. Furthermore, additional experiments confirm that our framework achieves more precise motion control than existing text-guided methods, paving the way for a novel direction of generative video coding guided by geometric motion modeling. Codes and visual results can be found in the supplementary materials.

T-GVC: Trajectory-Guided Generative Video Coding at Ultra-Low Bitrates

Robust and discriminative feature learning is critical for high-quality point cloud registration. However, existing deep learning–based methods typically rely on Euclidean neighborhood-based strategies for feature extraction, which struggle to effectively capture the implicit semantics and structural consistency in point clouds. To address these issues, we propose a multi-domain context integration network (MCI-Net) that improves feature representation and registration performance by aggregating contextual cues from diverse domains. Specifically, we propose a graph neighborhood aggregation module, which constructs a global graph to capture the overall structural relationships within point clouds. We then propose a progressive context interaction module to enhance feature discriminability by performing intra-domain feature decoupling and inter-domain context interaction. Finally, we design a dynamic inlier selection method that optimizes inlier weights using residual information from multiple iterations of pose estimation, thereby improving the accuracy and robustness of registration. Extensive experiments on indoor RGB-D and outdoor LiDAR datasets show that the proposed MCI-Net significantly outperforms existing state-of-the-art methods, achieving the highest registration recall of 96.4% on 3DMatch. The source code will be released.

MCI-Net: A Robust Multi-Domain Context Integration Network for Point Cloud Registration

Large Language Models (LLMs) show remarkable potential for few-shot information extraction (IE), yet their performance is highly sensitive to the choice of in-context examples. Conventional selection strategies often fail to provide informative guidance, as they overlook a key source of model fallibility: confusion stemming not just from semantic content, but also from the generation of well-structured formats required by IE tasks. To address this, we introduce Active Prompting for Information Extraction (APIE), a novel active prompting framework guided by a principle we term introspective confusion. Our method empowers an LLM to assess its own confusion through a dual-component uncertainty metric that uniquely quantifies both Format Uncertainty (difficulty in generating correct syntax) and Content Uncertainty (inconsistency in extracted semantics). By ranking unlabeled data with this comprehensive score, our framework actively selects the most challenging and informative samples to serve as few-shot exemplars. Extensive experiments on four benchmarks show that our approach consistently outperforms strong baselines, yielding significant improvements in both extraction accuracy and robustness. Our work highlights the critical importance of a fine-grained, dual-level view of model uncertainty when it comes to building effective and reliable structured generation systems.

Reflect Then Learn: Active Prompting for Information Extraction Guided by Introspective Confusion

The rapid advancements in face forgery techniques necessitate that detectors continuously adapt to new forgery methods, thus situating face forgery detection within a continual learning paradigm. However, when detectors learn new forgery types, their performance on previous types often degrades rapidly, a phenomenon known as catastrophic forgetting. Kolmogorov-Arnold Networks (KANs) utilize locally plastic splines as their activation functions, enabling them to learn new tasks by modifying only local regions of the functions while leaving other areas unaffected. Therefore, they are naturally suitable for addressing catastrophic forgetting. However, KANs have two significant limitations: 1) the splines are ineffective for modeling high-dimensional images, while alternative activation functions that are suitable for images lack the essential property of locality; 2) in continual learning, when features from different domains overlap, the mapping of different domains to distinct curve regions always collapses due to repeated modifications of the same regions. In this paper, we propose a **KAN**-based **C**ontinual Face **F**orgery **D**etection (KAN-CFD) framework, which includes a Domain-Group KAN Detector (DG-KD) and a data-free replay Feature Separation strategy via KAN Drift Compensation Projection (FS-KDCP). DG-KD enables KANs to fit high-dimensional image inputs while preserving locality and local plasticity. FS-KDCP avoids the overlap of the KAN input spaces without using data from prior tasks. Experimental results demonstrate that the proposed method achieves superior performance while notably reducing forgetting.

Unifying Locality of KANs and Feature Drift Compensation Projection for Data-Free Replay Based Continual Face Forgery Detection

Maintaining consistent 3D scene representations over time is a significant challenge in computer vision. 
Updating 3D scenes from sparse-view observations is crucial for various real-world applications, including urban planning, disaster assessment, and historical site preservation, where dense scans are often unavailable or impractical. In this paper, we propose Cross-Temporal 3D Gaussian Splatting (Cross-Temporal 3DGS), a novel framework for efficiently reconstructing and updating 3D scenes across different time periods, using sparse images and previously captured scene priors. 
Our approach comprises three stages: 1) Cross-temporal camera alignment for estimating and aligning camera poses across different timestamps; 2) Interference-based confidence initialization to identify unchanged regions between timestamps, thereby guiding updates; and 3) Progressive cross-temporal optimization, which iteratively integrates historical prior information into the 3D scene to enhance reconstruction quality.
Our method supports non-continuous capture, enabling not only updates using new sparse views to refine existing scenes, but also recovering past scenes from limited data with the help of current captures. Furthermore, we demonstrate the potential of this approach to achieve temporal changes using only sparse images, which can later be reconstructed into detailed 3D representations as needed. Experimental results show significant improvements over baseline methods in reconstruction quality and data efficiency, making this approach a promising solution for scene versioning, cross-temporal digital twins, and long-term spatial documentation.

Cross-temporal 3D Gaussian Splatting for Sparse-view Guided Scene Update

Long-term series forecasting leverages historical observations to predict extended future sequences and plays a crucial role across various domains. However, conventional models relying on fixed-length lookback windows struggle with the inherent dynamic dependencies and multi-scale characteristics of time series data. These fixed windows either introduce noise through excessive length or omit critical patterns when too short, while optimal window sizes vary significantly with tasks and external conditions. To address this, we propose the Adaptive Lookback Window (ALW) framework, a wavelet transform-driven approach for multi-scale adaptive lookback window selection. ALW decomposes time series into distinct frequency components through wavelet transforms, quantifies the contribution of each historical time step via scale-specific attention mechanisms, and dynamically determines optimal window lengths through information reverse accumulation and soft truncation techniques. Finally, refined input features are generated for downstream prediction models via a weighted reconstruction process. Extensive evaluations across multiple public benchmarks demonstrate that ALW, as an efficient plug-and-play technique, not only reduces the MSE of backbone models by an average of 3.2% but also reduces hyperparameter tuning requirements and enables input feature dimensionality reduction, which curtails subsequent model computational costs.

Efficiently Enhancing Long-term Series Forecasting via Adaptive Lookback with Wavelets

Image reflection separation aims to disentangle the transmission layer and the reflection layer from a blended image. Existing methods rely on limited information from a single image, tending to confuse the two layers when their contrasts are similar, a challenge more severe at night. To address this issue, we propose the Depth-Memory Decoupling Network (DMDNet). It employs the Depth-Aware Scanning (DAScan) to guide Mamba toward salient structures, promoting information flow along semantic coherence to construct stable states. Working in synergy with DAScan, the Depth-Synergized State-Space Model (DS-SSM) modulates the sensitivity of state activations by depth, suppressing the spread of ambiguous features that interfere with layer disentanglement. Furthermore, we introduce the Memory Expert Compensation Module (MECM), leveraging cross-image historical knowledge to guide experts in providing layer-specific compensation. To address the lack of datasets for nighttime reflection separation, we construct the Nighttime Image Reflection Separation (NightIRS) dataset. Extensive experiments demonstrate that DMDNet outperforms state-of-the-art methods in both daytime and nighttime settings. The code and dataset are available at https://github.com/fashyon/DMDNet.

Content not yet available

Next from AAAI 2026

Vision-language Incremental Learning with Dual Class-individual Memory

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES