Singapore

Multimedia documents such as slide presentations and posters are designed to be interactive and easy to modify. Yet, they are often distributed in a static raster format, which limits editing and customization. Restoring their editability requires converting these raster images back into structured vector formats. However, existing geometric raster vectorization methods, which rely on low-level primitives like curves and polygons, fall short at this task. Specifically, when applied to complex documents like slides, they fail to preserve the high-level structure, resulting in a flat collection of shapes where the semantic distinction between image and text elements is lost. To overcome this limitation, we address the problem of semantic document derendering by introducing SliDer, a novel framework that uses Vision-Language Models (VLMs) to derender slide images as compact and editable Scalable Vector Graphic (SVG) representations. SliDer detects and extracts the attributes from individual image and text elements in a raster input and organizes them into a coherent SVG format. Crucially, the model iteratively refines its predictions during inference in a process analogous to human design, generating SVG code that more faithfully reconstructs the original raster upon rendering. Furthermore, we introduce Slide2SVG, a novel dataset comprising raster-SVG pairs of slide documents curated from real-world scientific presentations, to facilitate future research in this domain. Our results demonstrate that SliDer achieves a reconstruction LPIPS of 0.069, and is favored by human evaluators in 82.9% of cases compared to the strongest zero-shot VLM baseline. Our code and dataset will be publicly released at a later stage.

AAAI 2026

Semantic Document Derendering: SVG Reconstruction via Vision-Language Modeling

cv: large vision models

cv: language and vision

cv: applications

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Diffusion models (DMs) have demonstrated to be powerful priors for signal recovery, but their application to 1-bit quantization tasks, such as 1-bit compressed sensing and logistic regression, remains a challenge. This difficulty stems from the inherent non-linear link function in these tasks, which is either non-differentiable or lacks an explicit characterization. To tackle this issue, we introduce Diff-OneBit, which is a fast and effective DM-based approach for signal recovery under 1-bit quantization. Diff-OneBit addresses the challenge posed by non-differentiable or implicit links functions via leveraging a differentiable surrogate likelihood function to model 1-bit quantization, thereby enabling gradient based iterations. This function is integrated into a flexible plug-and-play framework that decouples the data-fidelity term from the diffusion prior, allowing any pretrained DM to act as a denoiser within the iterative reconstruction process. Extensive experiments on the FFHQ and CelebA datasets demonstrate that Diff-OneBit gives high-fidelity reconstructed images, outperforming state-of-the-art methods in both reconstruction quality and computational efficiency across 1-bit compressed sensing and logistic regression tasks.

Diffusion Model Based Signal Recovery Under 1-Bit Quantization

Adversarial perturbations (APs) have become a great concern in image classification tasks. The most challenging branch, universal adversarial perturbations (UAPs), are exploited to fool most of the unseen samples. Such one-to-all perturbations have the merit of transferability, which has strong practical significance. In this paper, we firstly define the transferability gap and the algorithm stability of the UAP algorithm, and prove the relationship between them. In analyzing the UAP algorithm stability, we prove that the convergence domain of existing UAP algorithms with dynamic constraints is excessively small, which degrades the capacity of UAPs. Thus, we further propose a new expected constraint and prove that UAPs in the expected constraint suit any sample in a high probability. Besides, we propose a Stochastic Universal Adversarial Perturbation (SUAP) that involves additive noise and the expected constraint. Finally, by treating the proposed algorithm as a stochastic differential equation, we prove an upper bound of the UAP algorithm stability of SUAP, which decreases exponentially at the beginning and then increases with a sublinear rate to at most a fixed constant. Experimental results show that SUAP outperforms existing UAP algorithms with better white-box transferability.

Stochastic Universal Adversarial Perturbations with Fixed Optimization Constraint and Ensured High-probability Transferability

Missing data presents a widespread challenge in real-world data collection. In this paper, our goal is to impute missing entries while accurately reflecting the uncertainty associated with them. We introduce U-VAE, a method that employs a non-parametric distributional learning strategy to parameterize the likelihood of missing values. To address the infeasibility of directly estimating the underlying conditional distributions due to data incompleteness, we incorporate stochastic re-masking and un-masking techniques during training. Specifically, we replace the conventional reconstruction loss with the continuous ranked probability score (CRPS), a strictly proper scoring rule, and theoretically demonstrate that the discrepancy between the underlying conditional distribution and our imputer is upper-bounded. We evaluate the performance of U-VAE on 11 real-world datasets, showing its effectiveness in both single and multiple imputations, while also enhancing post-imputation performance and supporting valid statistical inference.

Impute Missing Entries with Uncertainty

Large Language Models (LLMs) perform excellently in fake news detection tasks, but their outputs are often accompanied with hallucination phenomena, i.e., generated content that is contradictory or deviates from facts. Previous studies have mostly mitigated hallucinations through prompt design. However, this paper reveals that regions in news articles which easily induce hallucination in LLMs highly correspond to challenges of fake news detectors. Based on this finding, we propose a fake news detection framework(PHPFND) based on post-hoc processing of LLMs hallucination. Specifically, our framework includes a hallucination detection module(ISHD) based on information structuring that detecting three types of hallucinations in LLMs in a targeted manner, and a hallucination-driven feature enhancement mechanism (HDFE) that incorporates hallucination signals as explicit features into sentence-level encoding and feature fusion to guide the model’s attention toward high-risk regions.
Experimental results on two mainstream fake news datasets show that the our proposed method significantly outperforms mainstream LLMs-based baselines.

PHPFND: Detecting Fake News via Post-Hoc Processing of LLMs Hallucination

Recently, Automatic Speech Recognition (ASR) systems (e.g., Whisper) have achieved remarkable accuracy improvements but remain highly sensitive to real-world unseen data (data with large distribution shifts), including noisy environments and diverse accents. To address this issue, test-time adaptation (TTA) has shown great potential in improving the model adaptability at inference time without ground-truth labels, and existing TTA methods often rely on pseudo-labeling or entropy minimization. However, by treating model confidence as a learning signal, these methods may reinforce high-confidence errors, leading to confirmation bias that undermines adaptation.
To overcome these limitations, we present ASR-TRA, a novel Test-time Reinforcement Adaptation framework inspired by causal intervention. More precisely, our method introduces a learnable decoder prompt and utilizes temperature-controlled stochastic decoding to generate diverse transcription candidates. These are scored by a reward model that measures audio-text semantic alignment, and the resulting feedback is used to update both model and prompt parameters via reinforcement learning.
Comprehensive experiments on LibriSpeech with synthetic noise and L2 Arctic accented English datasets demonstrate that our method significantly outperforms existing state-of-the-art (SOTA), including SUTA and SGEM, in both accuracy and inference speed. Ablation studies further confirm the effectiveness of combining audio and language-based rewards, highlighting our method's enhanced stability and interpretability. Overall, our approach provides a practical and robust solution for deploying ASR systems in challenging real-world conditions.

Boosting ASR Robustness via Test-Time Reinforcement Learning with Audio-Text Semantic Rewards

Large language models (LLMs) are increasingly deployed with hierarchical instruction schemes, where certain instructions (e.g., system-level directives) are expected to take precedence over others (e.g., user messages). Yet, we lack a systematic understanding of how effectively these hierarchical control mechanisms work. We introduce a systematic evaluation framework based on constraint prioritization to assess how well LLMs enforce instruction hierarchies. Our experiments across six state-of-the-art LLMs reveal that models struggle with consistent instruction prioritization, even for simple formatting conflicts. We find that the widely-adopted system/user prompt separation fails to establish a reliable instruction hierarchy, and models exhibit strong inherent biases toward certain constraint types regardless of their priority designation. We find that LLMs more reliably obey constraints framed through natural social hierarchies (e.g., authority, expertise, consensus) than system/user roles, which suggests that pretraining-derived social structures act as latent control priors, with potentially stronger influence than post-training guardrails.

Control Illusion: The Failure of Instruction Hierarchies in Large Language Models

Recent advances in vision–language models (VLMs) have shed light on human-level embodied intelligence. However, existing benchmark for VLM-driven embodied agent still rely on pre-defined high-level command or discretised action spaces—``non-native'' settings that diverge markedly from the real world. Moreover, current benchmarks focus exclusively on high-level tasks, while lacking collaborative evaluation and analysis on both low- and high-level. To bridge these gaps, we present NativeEmbodied, a challenging benchmark for VLM-driven embodied agents that adopts a unified, native low-level action space. Built upon diverse simulated scenes, NativeEmbodied first designs three representative high-level tasks in complex scenarios to evaluate overall performance. For more detailed and comprehensive performance analysis, we further decouple the entangled skills behind complex tasks and construct four types of low-level tasks, each corresponding to a key fundamental embodied skill. This joint evaluation across task and skill granularities enables a fine-grained assessment of embodied agent. Comprehensive experiments on the best VLMs reveal pronounced deficiencies in certain fundamental embodied skills. Further analysis shows that these low-level bottlenecks severely constrain performance on high-level tasks. Our NativeEmbodied not only pinpoints the key challenges faced by current VLM-driven embodied agents, but also provides valuable insight for future development of this field.

How Foundational Skills Influence VLM-based Embodied Agents: A Native Perspective

We study the computational problem of computing a fair means clustering of discrete vectors, which admits an equivalent formulation as editing a colored matrix into one with few distinct color-balanced rows by changing at most k values. While NP-hard in both the fairness-oblivious and the fair settings, the problem is well-known to admit a fixed-parameter algorithm in the former "vanilla" setting. As our first contribution, we exclude an analogous algorithm even for highly restricted fair means clustering instances. We then proceed to obtain a full complexity landscape of the problem, and establish tractability results which capture three means of circumventing our obtained lower bound: placing additional constraints on the problem instances, fixed-parameter approximation, or using an alternative parameterization targeting tree-like matrices.

Matrix Editing Meets Fair Clustering: Parameterized Algorithms and Complexity

Developing neural network models to estimate spatial gene expression from pathological images is important for overcoming the high observational costs associated with spatial gene expression data. In prior studies, only a small subset of highly variable genes has been used for expression estimation, despite tens of thousands of genes being observed, in order to enable evaluation that mitigates the impact of observational noise. Genes outside this subset have been excluded from the training process as well. However, since there are likely co-expression relationships between genes, low-expression genes may still contribute to the estimation of the evaluation target. In this paper, we propose Auxiliary Gene Learning (AGL) that utilizes the benefit of the ignored genes by reformulating their expression estimation as auxiliary tasks and training them jointly with the primary tasks. To effectively leverage auxiliary genes, we must select a subset of auxiliary genes that positively influence the prediction of the evaluation genes. However, this is a challenging optimization problem due to the vast number of possible combinations. To overcome this challenge, we propose Prior-Knowledge-Based Differentiable Top-k Gene Selection via Bi-level Optimization (DkGSB), a method that ranks genes by leveraging prior knowledge and relaxes the combinatorial selection problem into a differentiable top-k selection problem. The experiments demonstrate the effectiveness of incorporating auxiliary genes into the learning process and show that the proposed method outperforms conventional auxiliary task learning approaches.

Auxiliary Gene Learning: Spatial Gene Expression Estimation by Auxiliary Gene Selection

Large language model (LLM) training demands extensive data parallelism, resulting in massive gradient communication overhead. While gradient quantization presents a promising solution, it faces two critical challenges: maintaining training stability for transformer architectures and adapting to modern AllReduce-based distributed communication systems. In this paper, we propose BitDP, an ultra-low bit gradient quantization and data parallelism system that reduces communication costs by up to 32× while preserving model accuracy with less than 1\% performance degradation. Our approach ensures numerical stability for large transformer models and seamlessly integrates with existing AllReduce infrastructures. We validate BitDP's effectiveness across various LLM sizes and architectural variants, achieving significant training efficiency improvements while maintaining convergence quality. These results establish BitDP as a scalable and reliable solution for real-world LLM training at industrial scales.

Content not yet available

Next from AAAI 2026

Diffusion Model Based Signal Recovery Under 1-Bit Quantization

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES