Singapore

We introduce DSCodeBench, a new benchmark designed to evaluate large language models (LLMs) on complicated and realistic data science code generation tasks.
DSCodeBench consists of 1,000 carefully constructed problems sourced from realistic problems from GitHub across ten widely used Python data science libraries.
DSCodeBench offers a more challenging and representative testbed, more complex code solutions, more comprehensive data science libraries, clearer and better structured problem descriptions, and stronger test suites.
To construct the DSCodeBench, we develop a robust pipeline that combines task scope selection, code construction, test case generation, and problem description synthesis.
The process is paired with rigorous manual editing to ensure alignment and enhance the reliability of the evaluation.
Experimental result shows that DSCodeBench exhibits robust scaling behavior, where larger models systematically outperform smaller ones, validating its ability to distinguish model capabilities.
The best LLM we test, GPT-4o, has a pass@1 of 0.392, indicating that LLMs still have a large room to improve for realistic data science code generation tasks. 
We believe DSCodeBench will serve as a rigorous and trustworthy foundation for advancing LLM-based data science programming.

AAAI 2026

DSCodeBench: A Realistic Benchmark for Data Science Code Generation

code generation

benchmark

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Learning rate scheduling is crucial for training large language models, yet understanding the optimal annealing strategies across different model configurations remains challenging. In this work, we investigate the transferability of annealing dynamics in large language model training and refine a generalized predictive framework for optimizing annealing strategies under the Warmup-Steady-Decay (WSD) scheduler. Our improved framework incorporates training steps, maximum learning rate, and annealing behavior, enabling more efficient optimization of learning rate schedules. Our work provides a practical guidance for selecting optimal annealing strategies without exhaustive hyperparameter searches, demonstrating that smaller models can serve as reliable proxies for optimizing the training dynamics of larger models. We validate our findings on extensive experiments using both Dense and Mixture-of-Experts (MoE) models, demonstrating that optimal annealing ratios follow consistent patterns and can be transferred across different training configurations.

Scaling and Transferability of Annealing Strategies in Large Language Model Training

Diffusion models have achieved impressive generative performance across diverse domains such as image, video, and scientific data generation. However, fine-tuning these models for new tasks remains challenging due to their large scale, architectural diversity, and high sensitivity to hyperparameters—particularly learning rates. In this work, we propose Wasserstein-Aware Transfer (WAT), a principled and effective fine-tuning strategy grounded in diffusion trajectory analysis and optimal transport theory. Our key insight is that the distributional discrepancies between diffusion trajectories from different datasets decrease progressively over time and converge near the noise end. Based on this observation, we introduce a class-wise matching mechanism that minimizes the Wasserstein distance between class distributions of source and target datasets. This enables alignment at the class level without modifying the standard fine-tuning pipeline. To further enhance knowledge retention, we propose a novel sampling strategy that linearly combines class-conditional outputs from both pretrained and fine-tuned models. This method is simple yet effective, requiring negligible computational overhead while preserving domain-specific and generalizable knowledge. Extensive experiments across seven diverse benchmarks demonstrate that WAT reliably enhances generation quality under distribution shifts, outperforming competitive baselines. These results underscore its robustness and affirm the potential of optimal transport as a principled basis for knowledge transfer in diffusion models.

Wasserstein-Aware Transfer: Class-Level Alignment for Robust Diffusion Model Adaptation

In a public goods game, every player chooses whether or not to produce a good that all neighboring players will have access to. We consider a setting in which the public good is indivisible, neighboring players are out-neighbors in a directed graph, and there is a capacity constraint on their number, $k$, that can benefit from the good. This means that each player makes a two-pronged decision: decide whether or not to produce and, conditional on producing, choose which $k$ out-neighbors to share access. We examine both pure and mixed Nash equilibria in the model from the perspective of existence, computation, and efficiency. We perform a comprehensive study for these three dimensions with respect to both sharing capacity ($k$) and the network structure (the underlying directed graph), and establish sharp complexity dichotomies for each.

Public Goods Games in Directed Networks with Constraints on Sharing

Backdoor attacks pose severe threats to machine learning systems, prompting extensive research in this area. However, most existing work focuses on single-target All-to-One (A2O) attacks, overlooking the more complex All-to-X (A2X) attacks with multiple target classes, which are often assumed to have low attack success rates. In this paper, we first demonstrate that A2X attacks are robust against state-of-the-art defenses. We then propose a novel attack strategy that enhances the success rate of A2X attacks while maintaining robustness by optimizing grouping and target class assignment mechanisms. Our method improves the attack success rate by up to 28%, with average improvements of 6.7%, 16.4%, 14.1% on CIFAR10, CIFAR100, and Tiny-ImageNet, respectively. We anticipate that this study will raise awareness of A2X attacks and stimulate further research in this underexplored area.

Enhancing All-to-X Backdoor Attacks with Optimized Target Class Mapping

Lifelogging involves the continuous and comprehensive recording of a user’s daily activities, behaviors, and interactions, offering valuable insights for personalized healthcare, event retrieval, and lifestyle analysis. However, extracting meaningful patterns from lifelog data requires models to capture deeper temporal contexts beyond simple retrieval. To address this, we introduce ContextGraph, a lifelog intelligence framework that models lifelogs as a Temporal Knowledge Graph (TKG) to reason about the user’s evolving life patterns over time. ContextGraph computes Day Context Embeddings (DCE) to encode the temporal spread and social scene context of user's daily behavior. Then a novel Lens module extracts semantically meaningful subgraph snapshots around an anchor node in the TKG, representing specific personal contexts in the user’s life. The Lens module also computes an evolution signature for each subgraph, indicating whether it is growing, decaying, or remaining static. By analyzing these evolution signatures, ContextGraph provides actionable insights into the user’s lifelogs such as stable routines, behavioral drifts, or lifestyle changes. Our experiments showcase DCE's versatility, outperforming baselines in graph/node classification and reasoning on the Enzyme and DBLP datasets.

ContextGraph: Lifelog Intelligence Framework for Contextual Subgraph Evolution

Low-Rank Adaptation (LoRA) is the leading parameter-efficient fine-tuning method for Large Language Models (LLMs), but it still suffers from catastrophic forgetting. Recent work has shown that specialized LoRA initialization can alleviate catastrophic forgetting. There are currently two approaches to LoRA initialization aimed at preventing knowledge forgetting during fine-tuning: (1) making residual weights close to pre-trained weights, and (2) ensuring the space of LoRA initialization is orthogonal to pre-trained knowledge. The former is what current methods strive to achieve, while the importance of the latter is not sufficiently recognized. We find that the space of LoRA initialization is the key to preserving pre-trained knowledge rather than the residual weights. Existing methods like MiLoRA propose making the LoRA initialization space orthogonal to pre-trained weights. However, MiLoRA utilizes the null space of pre-trained weights. Compared to pre-trained weights, the input activations of pre-trained knowledge take into account the parameters of all previous layers as well as the input data, while pre-trained weights only contain information from the current layer. Moreover, we find that the effective ranks of input activations are much smaller than those of pre-trained weights. Thus, the null space of activations is more accurate and contains less pre-trained knowledge information compared to that of weights. Based on these, we introduce LoRA-Null, our proposed method that initializes LoRA in the null space of activations. Experimental results show that LoRA-Null effectively preserves the pre-trained world knowledge of LLMs while achieving good fine-tuning performance, as evidenced by extensive experiments. The code is included in the supplementary material.

Put the Space of LoRA Initialization to the Extreme to Preserve Pre-trained Knowledge

Mixed-precision quantization (MPQ) is crucial for deploying deep neural networks on resource-constrained devices, but finding the optimal bit-width for each layer represents a complex combinatorial optimization problem. Current state-of-the-art methods rely on computationally expensive search algorithms or local sensitivity heuristic proxies like the Hessian, which fail to capture the cascading global effects of quantization error. In this work, we argue that the quantization sensitivity of a layer should not be measured by its local properties, but by its impact on the information flow throughout the entire network. We introduce InfoQ, a novel framework for mixed-precision quantization that is training-free in the bit-width search phase. InfoQ assesses layer importance by performing a single forward pass to measure the change in mutual information in the remaining part of the network, thus creating a global sensitivity score. This approach directly quantifies how quantizing one layer degrades the information characteristics of subsequent layers. The resulting scores are used to formulate bit-width allocation as an integer linear programming problem, which is solved efficiently to minimize total sensitivity under a given budget (e.g., model size or BitOps). Our retraining-free search phase provides a superior search-time/accuracy trade-off (using two orders of magnitude less data compared to state-of-the-art methods such as LIMPQ), while yielding up to a 1\% accuracy improvement for MobileNetV2 and ResNet18 on ImageNet at high compression rates (14.00$\times$ and 10.66$\times$).

InfoQ: Mixed-Precision Quantization via Global Information Flow

Image-guided surgery demands adaptive, real-time decision support, yet static AI models struggle with structured task planning and providing interactive guidance. Large language models (LLMs)-powered agents offer a promising solution by enabling dynamic task planning and predictive decision support. Despite recent advances, the absence of surgical agent datasets and robust parameter-efficient fine-tuning techniques limits the development of LLM agents capable of complex intraoperative reasoning. In this paper, we introduce Surgical AI Copilot, an LLM agent for image-guided pituitary surgery, capable of conversation, planning, and task execution in response to queries involving tasks such as MRI tumor segmentation, endoscope anatomy segmentation, overlaying preoperative imaging with intraoperative views, instrument tracking, and surgical visual question answering (VQA). To enable structured agent planning, we develop the PitAgent dataset, a surgical context-aware planning dataset covering surgical tasks like workflow analysis, instrument localization, anatomical segmentation, and query-based reasoning. Additionally, we propose DEFT-GaLore, a Deterministic Energy-based Fourier Transform (DEFT) gradient projection technique for efficient low-rank adaptation of recent LLMs (e.g., LLaMA 3.2, Qwen 2.5), enabling their use as surgical agent planners. We extensively validate our agent's performance and the proposed adaptation technique against other state-of-the-art low-rank adaptation methods on agent planning and prompt generation tasks, including a zero-shot surgical VQA benchmark, demonstrating the significant potential for truly efficient and scalable surgical LLM agents in real-time operative settings.

Surgical AI Copilot: Energy-Based Fourier Gradient Low-Rank Adaptation for Surgical LLM Agent Reasoning and Planning

Large Vision-Language Models (LVLMs) have recently advanced robotic manipulation by leveraging vision for scene perception and language for instruction following. However, existing methods rely heavily on costly human-annotated training datasets, which limits their generalization and causes them to struggle in out-of-domain (OOD) scenarios, reducing real-world adaptability. To address these challenges, we propose ManipLVM-R1, a novel reinforcement learning framework that replaces traditional supervision with Reinforcement Learning using Verifiable Rewards (RLVR). By directly optimizing for task-aligned outcomes, our method enhances generalization and physical reasoning while removing the dependence on costly annotations. Specifically, we design two rule-based reward functions targeting key robotic manipulation subtasks: an Affordance Perception Reward to enhance localization of interaction regions, and a Trajectory Match Reward to ensure the physical plausibility of action paths. These rewards provide immediate feedback and impose spatial-logical constraints, encouraging the model to go beyond shallow pattern matching and instead learn deeper, more systematic reasoning about physical interactions.

ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models

Optimising traffic signals is crucial for mitigating urban congestion, and automated planning, particularly with PDDL+, has shown promise for real-world deployment due to its flexibility and centralised perspective. While existing PDDL+ models guarantee deployability on current infrastructure, they face significant limitations: reliance on domain-independent heuristics restricts their applicability and their scalability, and leads to slow solution generation and unclear quality of plans. 

To overcome these challenges and unlock the widespread adoption of planning-based traffic control, we introduce $h^{\text{CAFE}}$, a domain-specific heuristic for PDDL+-based traffic signal optimisation. Unlike prior approaches, $h^{\text{CAFE}}$ is designed to work effectively across multiple problem encodings, addressing a key limitation of traditional domain-specific heuristics. We demonstrate its capabilities on real-world data from a region of the UK, showing significant improvements in solution generation time and search space exploration. Our evaluation also compares the strategies generated by $h^{\text{CAFE}}$ against historical data from existing traffic control systems and a non-deployable benchmark, confirming the high quality of the resulting plans.

Downloads

Next from AAAI 2026

Scaling and Transferability of Annealing Strategies in Large Language Model Training

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Scaling and Transferability of Annealing Strategies in Large Language Model Training

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads