Singapore

Text-to-SQL is a fundamental yet challenging task in the NLP area, aiming at translating natural language questions into SQL queries. While recent advances in large language models have greatly improved performance, most existing approaches depend on models with tens of billions of parameters or costly APIs, limiting their applicability in resource-constrained environments. For real world, especially on edge devices, it is crucial for Text-to-SQL to ensure cost-effectiveness. Therefore, enabling the light-weight models for Text-to-SQL is of great practical significance. However, smaller LLMs often struggle with complicated user instruction, redundant schema linking or syntax correctness. To address these challenges, we propose \textbf{MCTS-SQL}, a novel framework that uses Monte Carlo Tree Search to guide SQL generation through multi-step refinement. Since the light-weight models&#39; weak performance of single-shot prediction, we generate better results through several trials with feedback. However, directly applying MCTS-based methods inevitably leads to significant time and computational overhead. Driven by this issue, we propose a token-level \textbf{prefix-cache mechanism} that stores prior information during iterations, effectively improved the execution speed. Experiments results on the SPIDER and BIRD benchmarks demonstrate the effectiveness of our approach. Using a small open-source Qwen2.5-Coder-1.5B, our method outperforms ChatGPT-3.5. When leveraging a more powerful model Gemini 2.5 to explore the performance upper bound, we achieved results competitive with the SOTA. Our findings demonstrate that even small models can be effectively deployed in practical Text-to-SQL systems with the right strategy.

AAAI 2026

MCTS-SQL: Light-Weight LLMs Can Master the Text-to-SQL Through Monte Carlo Tree Search

motecarlo search

nl2sql

agent

Text-to-SQL is a fundamental yet challenging task in the NLP area, aiming at translating natural language questions into SQL queries. While recent advances in large language models have greatly improved performance, most existing approaches depend on models with tens of billions of parameters or costly APIs, limiting their applicability in resource-constrained environments. For real world, especially on edge devices, it is crucial for Text-to-SQL to ensure cost-effectiveness. Therefore, enabling the light-weight models for Text-to-SQL is of great practical significance. However, smaller LLMs often struggle with complicated user instruction, redundant schema linking or syntax correctness. To address these challenges, we propose \textbf{MCTS-SQL}, a novel framework that uses Monte Carlo Tree Search to guide SQL generation through multi-step refinement. Since the light-weight models' weak performance of single-shot prediction, we generate better results through several trials with feedback. However, directly applying MCTS-based methods inevitably leads to significant time and computational overhead. Driven by this issue, we propose a token-level \textbf{prefix-cache mechanism} that stores prior information during iterations, effectively improved the execution speed. Experiments results on the SPIDER and BIRD benchmarks demonstrate the effectiveness of our approach. Using a small open-source Qwen2.5-Coder-1.5B, our method outperforms ChatGPT-3.5. When leveraging a more powerful model Gemini 2.5 to explore the performance upper bound, we achieved results competitive with the SOTA. Our findings demonstrate that even small models can be effectively deployed in practical Text-to-SQL systems with the right strategy.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Using risky text prompts, such as pornography and violent prompts, to test the safety of text-to-image (T2I) models is a critical task. However, existing risky prompt datasets are limited in three key areas: 1) limited risky categories, 2) coarse-grained annotation, and 3) low effectiveness. To address these limitations, we introduce T2I-RiskyPrompt, a comprehensive benchmark designed for evaluating safety-related tasks in T2I models.
Specifically, we first develop a hierarchical risk taxonomy, which consists of 6 primary categories and 14 fine-grained subcategories. Building upon this taxonomy, we construct a pipeline to collect and annotate risky prompts. Finally, we obtain 6,432 effective risky prompts, where each prompt is annotated with both hierarchical category labels and detailed risk reasons. 
Moreover, to facilitate the evaluation, we propose a reason-driven risky image detection method that explicitly aligns the MLLM with safety annotations.
Based on T2I-RiskyPrompt, we conduct a comprehensive evaluation of eight T2I models, nine defense methods, five safety filters, and five attack strategies, offering nine key insights into the strengths and limitations of T2I model safety.
Finally, we discuss potential applications of T2I-RiskyPrompt across various research fields.
The dataset and code are provided in supplementary.

T2I-RiskyPrompt: A Benchmark for Safety Evaluation, Attack, and Defense on Text-to-Image Model

We introduce OceanSplat, a novel method that captures geometric structures in scattering media with high fidelity for real-time 3D underwater scene representation. To overcome the severe attenuation and scattering effects inherent in underwater environments, our method imposes trinocular stereo consistency on views translated along two orthogonal axes, effectively constraining the spatial placement of 3D Gaussians and preserving object geometry under complex medium conditions. We further apply self-supervised geometric regularization using a synthetic epipolar depth prior derived from these translated views to suppress medium-induced misplacement of 3D Gaussians. In addition, we align the rendered depth with the $z$-component of individual 3D Gaussians to suppress floaters and enhance structural fidelity. Furthermore, we propose a depth-aware alpha adjustment module that uses directional and depth cues to guide visibility learning in the early stages of training, preventing erroneous placements and medium entanglement. Our method effectively represents 3D scenes under scattering media without external geometric cues by preventing foreground 3D Gaussians from erroneously contributing to the medium in novel views, thereby preserving overall scene quality. Experiments on real-world underwater and simulated scenes demonstrate that our method outperforms prior approaches in 3D scene representation under scattering media.

OceanSplat: Object-aware Gaussian Splatting with Trinocular View Consistency for Underwater Scene Reconstruction

Deep learning-based image manipulation localization (IML) methods have achieved remarkable performance in recent years, but typically rely on large-scale pixel-level annotated datasets. To address the challenge of acquiring high-quality annotations, some recent weakly supervised methods utilize image-level labels to segment manipulated regions. However, the performance is still limited due to insufficient supervision signals. In this study, we explore a form of weak supervision that improves the annotation efficiency and detection performance, namely scribble annotation supervision. We re-annotated mainstream IML datasets with scribble labels and propose the first scribble-based IML (Sc-IML) dataset. Additionally, we propose the first scribble-based weakly supervised IML framework. Specifically, we employ self-supervised training with a structural consistency loss to encourage the model to produce consistent predictions under multi-scale and augmented inputs. In addition, we propose a prior-aware feature modulation module (PFMM) that adaptively integrates prior information from both manipulated and authentic regions for dynamic feature adjustment, further enhancing feature discriminability and prediction consistency in complex scenes. We also propose a gated adaptive fusion module (GAFM) that utilizes gating mechanisms to regulate information flow during feature fusion, guiding the model toward emphasizing potential tampered regions. Finally, we propose a confidence-aware entropy minimization loss (${\mathcal{L}}_{ {CEM }}$). This loss dynamically regularizes predictions in weakly annotated or unlabeled regions based on model uncertainty, effectively suppressing unreliable predictions. Experimental results show that our method outperforms existing fully supervised approaches in terms of average performance both in-distribution and out-of-distribution. Our Sc-IML dataset and code will be released upon acceptance.

Beyond Fully Supervised Pixel Annotations: Scribble-Driven Weakly-Supervised Framework for Image Manipulation Localization

Pansharpening is a powerful technique for generating high-resolution multispectral (HRMS) images by fusing currently available image pairs of low-resolution multispectral (LRMS) and texture-rich panchromatic (PAN) data, effectively addressing the physical constraints of satellite sensors. While recent generative diffusion models have demonstrated impressive performance gains in this domain, their prohibitive computational demands and training costs hinder practicality in resource-constrained remote sensing satellite systems. In this work, we propose NODiff, a novel diffusion framework that replaces the conventional attention-based denoising backbone with a neural operator, seamlessly integrating operator learning and generative modeling into an efficient yet effective solution for pansharpening.
In practice, we implement our approach through a two-stage learning paradigm: First, we pretrain the proposed Neural Operator-based diffusion model to learn the high-resolution texture priors essential for pansharpening. Afterward, we freeze the pretrained parameters, and design a lightweight conditional detail guidance adapter to enable efficient fine-tuning for generating desired HRMS images. Meanwhile, a time-aware low-rank adaptation is introduced to dynamically refine high-frequency details potentially affected by spectral mode truncation. Extensive experiments on multiple benchmark datasets demonstrate that NODiff achieves competitive pansharpening performance while significantly reducing training and inference costs. Beyond pansharpening, our method provides new insights into building resource-efficient generative models.

NODiff: Neural Operator Diffusion for Multispectral Image Fusion

Active domain adaptation (ADA) aims to select a small set of target samples for annotation and use them for training to maximally boost the adaptation performance. However, most existing ADA methods only rely on the original output of the model, without considering the relationship between the source and target domain features, which may lead to selecting uninformative samples. In this paper, we propose an effective ADA framework: Prototype-Driven Active Domain Adaptation with density consideration (PDADA). It selects the most valuable target samples in the presence of domain shift through two criteria: Density-Conscious Domainness (DCD) and Prototype-Driven Informativeness (PDI). Furthermore, considering the class imbalance and cluster looseness issues in sample selection and domain adaptation, we develop a Class Balanced Expansion (CBE) algorithm and the Adversarial Active Domain Adaptation via Protecting Structured Information (AADA-PSI). Extensive experiments demonstrate that under the cooperation of the above components, PDADA outperforms previous methods on several challenging benchmarks and can be generalized to multi-source active domain adaptation setting.

Prototype-Driven Active Domain Adaptation with Density Consideration

Although previous deep imputation methods (e.g., Genera-
tive Adversarial Network (GAN) based methods) have been
widely designed to impute missing values, they still suffer
from the issues, i.e., lack of both imputation diversity and
generalization ability. In this paper, we propose a new GAN-
based imputation method, namely Meta-GAIN, to investi-
gate a new generator for achieving diverse imputation and
generalization ability. Specifically, we employ the Kullback-
Leibler (KL) divergence to achieve the diversity of imputed
data by generating continuous embedding space of the origi-
nal data. We also design a task regularizer (i.e., a cross en-
tropy between the predicted results and the true labels) to
push the samples within the same class close and the sam-
ple in different classes far away to achieve generalization
ability. Moreover, we theoretically prove that our proposed
method achieves the generalization ability. In addition, we de-
sign a new meta network to efficient optimize our objective
function. Experimental results on real datasets show that our
proposed method outperforms all comparison methods under
different missing mechanisms in terms of imputation perfor-
mance and classification tasks.

Meta-GAIN for Missing Data Imputation

Neural representations (NRs), such as neural fields and 3D Gaussians, effectively model volumetric data in computed tomography (CT) but suffer from severe artifacts under sparse-view settings. To address this, we propose DiffNR, a novel framework that enhances NR optimization with diffusion priors. At its core is SliceFixer, a single-step diffusion model designed to correct artifacts in degraded slices. We integrate specialized conditioning layers into the network and develop tailored data curation strategies to support model finetuning. During reconstruction, SliceFixer periodically generates pseudo-reference volumes, providing auxiliary 3D perceptual supervision to fix underconstrained regions. Compared to prior methods that embed CT solvers into time-consuming iterative denoising, our repair-and-augment strategy avoids frequent diffusion model queries, leading to better runtime performance. Extensive experiments show that DiffNR improves PSNR by 3.99 dB on average, generalizes well across domains, and maintains efficient optimization.

DiffNR: Diffusion-Enhanced Neural Representation Optimization for Sparse-View 3D Tomographic Reconstruction

Recent advances in audio-driven talking-head synthesis have brought lip-sync precision close to human perception, yet emotional fidelity and real-time inference remain open challenges. Existing pipelines typically disentangle lip articulation, facial expression, and head pose in latent space; this rigid factorization ignores the intrinsic coupling between articulation and affect — e.g., downward lip corners when sad—thus limiting expressiveness.
We cast speech-conditioned facial motion as a sample from an emotion-conditioned distribution in a motion latent space. Concretely, we (i) learn a motion dictionary of orthogonal bases with an autoencoder via self-supervision, (ii) construct emotion-conditioned sub-spaces within the latent space, and (iii) design a layer-progressive cross-attention fusion module that modulates a flow-matching sampler with both audio and emotion signals. Only ten reverse ODE steps are required to generate a motion-latent trajectory, enabling real-time end-to-end latency.
Extensive experiments on MEAD and RAVDESS show that our method outperforms recent GAN- and diffusion-based baselines in emotion accuracy while running at around 75 FPS on a single desktop GPU. The proposed framework delivers the first emotionally expressive Audio2Face system that simultaneously achieves lip-sync accuracy, affective realism, and real-time performance.

Emotion-Conditioned Motion Sub-spaces with Flow Matching for Real-Time Audio-Driven Talking Heads

We present 360Explorer, a novel approach for generating 4D controllable panoramic videos conditioned on user-provided 3D instructions for exploring and manipulating dynamic worlds.
Compared to existing perspective-based methods struggle to address spatial consistency during camera rotation in place, we introduce the panoramic view in controllable video generation models to inherently maintain the view recall consistency.
By introducing dynamic point clouds as the 4D scene representations, 360Explorer unifies the modeling of camera transformations and object movements as incomplete renders to describe precise control instructions in 3D worlds.
To tackle the data limitation in acquiring multi-viewpoint panoramic videos, we further propose a reverse warping strategy to construct the training dataset on easily accessible monocular panoramic videos.
Extensive experiments demonstrate that 360Explorer achieves superior performance in creating 4D controllable panoramic videos with camera transformation and object movements aligned with diverse provided instructions.

360Explorer: Exploring 4D Controllable World in Panoramic Videos

Recent advances in differentiable structure learning have framed the combinatorial problem of learning directed acyclic graphs as a continuous optimization problem. Various aspects, including data standardization, have been studied to identify factors that influence the empirical performance of these methods. In this work, we investigate critical limitations in differentiable structure learning methods, focusing on settings where the true structure can be identified up to Markov equivalence classes, particularly in the linear Gaussian case. While recent work highlighted potential non-convexity issues in this setting, we demonstrate and explain why the use of $\ell_1$-penalized likelihood in such cases is fundamentally inconsistent, even if the global optimum of the optimization problem can be found. To resolve this limitation, we develop a hybrid differentiable structure learning method based on $\ell_0$-penalized likelihood with hard acyclicity constraint, where the $\ell_0$ penalty can be approximated by different techniques including Gumbel-Softmax. Specifically, we first estimate the underlying moral graph, and use it to restrict the search space of the optimization problem, which helps alleviate the non-convexity issue. Experimental results show that the proposed method enhances empirical performance both before and after data standardization, providing a more reliable path for future advancements in differentiable structure learning, especially for learning Markov equivalence classes.

Downloads

Next from AAAI 2026

T2I-RiskyPrompt: A Benchmark for Safety Evaluation, Attack, and Defense on Text-to-Image Model

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

T2I-RiskyPrompt: A Benchmark for Safety Evaluation, Attack, and Defense on Text-to-Image Model

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads