Singapore

Beyond scratch coding, exploiting large-scale code repositories (e.g., GitHub) for practical tasks is vital in real-world software development, yet current benchmarks rarely evaluate code agents in such authentic, workflow-driven scenarios. 
To bridge this gap, we introduce GitTaskBench, a benchmark designed to systematically assess this capability via 54 realistic tasks across 7 modalities and 7 domains. 
Each task pairs a relevant repository with an automated, human-curated evaluation harness specifying practical success criteria. Beyond measuring execution and task success, we also propose the alpha-value metric to quantify the economic benefit of agent performance, which integrates task success rates, token cost, and average developer salaries. Experiments across three state-of-the-art agent frameworks with multiple advanced LLMs show that leveraging code repositories for complex task solving remains challenging: even the best-performing system, OpenHands+Claude 3.7, solves only 48.15% of tasks (recent progress has pushed the frontier further, with RepoMaster+Claude 3.5 achieving a new record of 62.96%). Error analysis attributes over half of failures to seemingly mundane yet critical steps like environment setup and dependency resolution, highlighting the need for more robust workflow management and increased timeout preparedness.
By releasing GitTaskBench, we aim to drive progress and attention toward repository-aware code reasoning, execution, and deployment---moving agents closer to solving complex, end-to-end real-world tasks. The code &amp; datasets are public in https://github.com/QuantaAlpha/GitTaskBench. See the extended version at https://arxiv.org/abs/2508.18993.

AAAI 2026

GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging

nlp: code generation / program synthesis from natural language

nlp: (large) language models

and evaluation of nlp models

nlp: interpretability

nlp: applications

analysis

Beyond scratch coding, exploiting large-scale code repositories (e.g., GitHub) for practical tasks is vital in real-world software development, yet current benchmarks rarely evaluate code agents in such authentic, workflow-driven scenarios. 
To bridge this gap, we introduce GitTaskBench, a benchmark designed to systematically assess this capability via 54 realistic tasks across 7 modalities and 7 domains. 
Each task pairs a relevant repository with an automated, human-curated evaluation harness specifying practical success criteria. Beyond measuring execution and task success, we also propose the alpha-value metric to quantify the economic benefit of agent performance, which integrates task success rates, token cost, and average developer salaries. Experiments across three state-of-the-art agent frameworks with multiple advanced LLMs show that leveraging code repositories for complex task solving remains challenging: even the best-performing system, OpenHands+Claude 3.7, solves only 48.15% of tasks (recent progress has pushed the frontier further, with RepoMaster+Claude 3.5 achieving a new record of 62.96%). Error analysis attributes over half of failures to seemingly mundane yet critical steps like environment setup and dependency resolution, highlighting the need for more robust workflow management and increased timeout preparedness.
By releasing GitTaskBench, we aim to drive progress and attention toward repository-aware code reasoning, execution, and deployment---moving agents closer to solving complex, end-to-end real-world tasks. The code & datasets are public in https://github.com/QuantaAlpha/GitTaskBench. See the extended version at https://arxiv.org/abs/2508.18993.

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The core of efficient on-demand ride-pooling lies in solving the Ride-Pool Matching Problem (RMP), which involves assigning multiple customer requests to single vehicles under various service constraints (e.g., pickup windows, detour allowances, and vehicle occupancy). A significant missed opportunity in most current RMP approaches is the assumption that passengers must be picked up and dropped off exactly at their requested locations. Allowing passengers to walk even short distances to meet vehicles could unlock substantial improvements in ride-pooling operations.

Building upon the limitations of existing Ride-Pool Matching Problem (RMP) solutions that neglect passenger walkability, this paper introduces a novel matching method that strategically incorporates flexible pickup and drop-off locations. Our approach simultaneously determines the optimal assignment of vehicles to requests (one vehicle to potentially multiple requests and each request to at most one vehicle), identifies advantageous meeting points for passengers, and plans efficient vehicle routes. This comprehensive optimization respects all service constraints and considers the long-term implications of routing decisions. To achieve this integrated solution, we first employ a tree-based approach to enumerate all feasible pairings between passengers and vehicles. Subsequently, we calculate an optimal route for each of these feasible matches. Finally, we evaluate the quality of all possible assignments and select the most advantageous matching for implementation.

In our experimental evaluation on city-scale taxi datasets, we demonstrate that our method improves the number of served requests by up to 13\% and reduces the average vehicle travel distance by up to 21\%. By serving more passengers with less driving distance, our approach achieves greater efficiency in a more sustainable manner — using fewer resources to deliver better service and creating a win-win outcome for all stakeholders, including customers, drivers, the aggregator, and the environment.

Optimizing Ride-Pooling Operations with Extended Pickup and Drop-Off Flexibility

The core problem in multi-view anomaly detection is to represent local neighborhoods of normal instances consistently across all views. Recent approaches consider a representation of local neighborhood in each view independently, and then capture the consistent neighbors across all views via a learning process. They suffer from two key issues. First, there is no guarantee that they can capture consistent neighbors well, especially when the same neighbors are in regions of varied densities in different views, resulting in inferior detection accuracy. Second, the learning process has a high computational cost of $\mathcal{O}(N^2)$, rendering them inapplicable for large datasets. To address these issues, we propose a novel method termed \textbf{S}pherical \textbf{C}onsistent \textbf{N}eighborhoods \textbf{E}nsemble (SCoNE). It has two unique features: (a) the consistent neighborhoods are represented with multi-view instances directly, requiring no intermediate representations as used in existing approaches; and (b) the neighborhoods have data-dependent properties, which lead to large neighborhoods in sparse regions and small neighborhoods in dense regions. The data-dependent properties enable local neighborhoods in different views to be represented well as consistent neighborhoods, without learning. This leads to $\mathcal{O}(N)$ time complexity. Empirical evaluations show that SCoNE has superior detection accuracy and runs orders-of-magnitude faster in large datasets than existing approaches.

SCoNE: Spherical Consistent Neighborhoods Ensemble for Effective and Efficient Multi-View Anomaly Detection

We study strategic candidate nomination by parties in elections decided by Plurality voting. Each party selects a nominee before the election, and the winner is chosen from the nominated candidates based on the voters' preferences. We introduce a new restriction on these preferences, which we call party-aligned single-peakedness: all voters agree on a common ordering of the parties along an ideological axis, but may differ in their perceptions of the positions of individual candidates within each party. The preferences of each voter are single-peaked with respect to their own axis over the candidates, which is consistent with the global ordering of the parties.
We present a polynomial-time algorithm for recognizing whether a preference profile satisfies party-aligned single-peakedness. 
In this domain, we give polynomial-time algorithms for deciding whether a given party can become the winner under some (or all) nominations, and whether this can occur in some pure Nash equilibrium. Moreover, we prove a tight result about the guaranteed existence of pure strategy Nash equilibria for elections with up to three parties for single-peaked and party-aligned single-peaked preference profiles.

Computing Equilibrium Nominations in Presidential Elections

Test-time adaptation (TTA) enables online model adaptation using only unlabeled test data, aiming to bridge the gap between source and target distributions. However, in multimodal scenarios, varying degrees of distribution shift across different modalities give rise to a complex coupling effect of unimodal shallow feature shift and cross-modal high-level semantic misalignment, posing a major obstacle to extending existing TTA methods to the multimodal field. To address this challenge, we propose a novel multimodal test-time adaptation (MMTTA) framework, termed as Bridging Modalities via Progressive Re-alignment (BriMPR). BriMPR, consisting of two progressively enhanced modules, tackles the coupling effect with a divide-and-conquer strategy. Specifically, we first decompose MMTTA into multiple unimodal feature alignment sub-problems. By leveraging the strong function approximation ability of prompt tuning, we calibrate the unimodal global feature distributions to their respective source distributions, so as to achieve the initial semantic re-alignment across modalities. Subsequently, we assign the credible pseudo-labels to combinations of masked and complete modalities, and introduce inter-modal instance-wise contrastive learning to further enhance the information interaction among modalities and refine the alignment. Extensive experiments on MMTTA tasks, including both corruption-based and real-world domain shift benchmarks, demonstrate the superiority of our method.

Bridging Modalities via Progressive Re-alignment for Multimodal Test-Time Adaptation

Existing battery State of Health (SOH) prediction approaches often struggle to provide both accurate predictions and reliable uncertainty estimates. This paper presents a novel Multi-Task Learning (MTL) framework that jointly tackles SOH prediction and provides a proxy metric for uncertainty through a unified architecture. The framework combines a Physics-Informed Neural Network (PINN) for SOH prediction with a deep autoencoding Gaussian mixture model for uncertainty modeling. Particularly, the energy score from the Gaussian mixture model serves as a proxy metric for uncertainty, where a higher score indicates potential prediction unreliability. Moreover, to enhance task-specific learning, we employ a multi-head attention mechanism that adaptively captures distinct feature relationships. 
Our experiments show improvements in prediction performance compared to the state-of-the-art baseline.
A comprehensive evaluation on six XJTU battery benchmark datasets demonstrates that our framework achieves a prediction accuracy of 99.50% (MAPE: 0.0050) while providing reliable uncertainty quantification through the proxy metric.

Physics-Informed Multi-Task Learning for Battery State of Health Prediction with Uncertainty Quantification

High-resolution Earth Observation technologies present unprecedented opportunities for geospatial analysis, yet traditional 2D aerial-view semantic segmentation remains limited by its inability to model spatial relationships and handle object occlusions. While 3D Aerial-view Segmentation (3DAS) has emerged to address these limitations, existing methods predominantly rely on 2D discriminative models pre-trained on natural scenes. These models struggle to accurately recognize aerial-view imagery, resulting in suboptimal performance due to significant domain discrepancies. This paper introduces ID-Splat, a novel object-centric framework that directly leverages multi-view object identities without discriminative information to enhance 3D semantic understanding. ID-Splat implements a two-stage process: first, Mask-object Tracking combines SAM and Point Tracking to establish robust and consistent object identities across multi-view aerial images; second, Object Integration & Propagation assigns these identities to 3D Gaussian Splatting (3DGS) points, enabling complete 3D segmentation through semantic propagation. Experimental results on the 3D-AS dataset demonstrate that ID-Splat significantly outperforms existing methods, particularly under sparse supervision conditions. ID-Splat also achieves state-of-the-art performance while reducing the need for extensive labeled data by effectively leveraging the inherent 3D structure.

ID-Splat: Propagating Object Identities for Segmenting 3D Aerial-view Scenes

Continual Learning with Pre-trained Models holds great promise for efficient adaptation across sequential tasks. However, most existing approaches freeze PTMs and rely on auxiliary modules like prompts or adapters, limiting model plasticity and leading to suboptimal generalization when facing significant distribution shifts. While full fine-tuning can improve adaptability, it risks disrupting crucial pre-trained knowledge. In this paper, we propose Mutual Information-guided Sparse Tuning (MIST), a plug-and-play method that selectively updates a small subset of PTM parameters, less than 5%, based on sensitivity to mutual information objectives. MIST enables effective task-specific adaptation while preserving generalization. To further reduce interference, we introduce strong sparsity regularization by randomly dropping gradients during tuning, resulting in fewer than 0.5% of parameters being updated per step. Applied before standard freeze-based methods, MIST consistently boosts performance across diverse continual learning benchmarks. Experiments show that integrating our method into multiple baselines yields significant performance gains. Our code is available at \url{https://github.com/zhwhu/MIST}.

Sparse Tuning Enhances Plasticity in PTM-based Continual Learning

Gaussian process regression (GPR) is a popular nonparametric Bayesian method that provides predictive uncertainty estimates and is widely used in safety-critical applications. While prior research has introduced various uncertainty bounds, most existing approaches 
require access to specific input features and rely on posterior mean and variance estimates or tuning hyperparameters. These limitations hinder robustness and fail to capture the model’s global behavior in expectation.
To address these limitations, we propose a chaining-based framework for estimating upper and lower bounds on the expected extreme values over unseen data, without requiring access to specific input locations. We provide kernel-specific refinements for commonly used kernels such as RBF and Matérn, in which our bounds are tighter than generic constructions. We further improve numerical tightness by avoiding analytical relaxations. 
In addition to global estimation, we also develop a novel method for local uncertainty quantification at specified inputs. This approach leverages chaining geometry through partition diameters, adapting to local structure without relying on posterior variance scaling. 
Our experimental results validate the theoretical findings and demonstrate that our method outperforms existing approaches on both synthetic and real-world datasets.

Practical Global and Local Bounds in Gaussian Process Regression via Chaining

This paper introduces **STAR-1**, a high-quality, just-1k-scale *safety* dataset specifically designed for large reasoning models (LRMs) like DeepSeek-R1. Built on three core principles --- diversity, deliberative reasoning, and rigorous filtering --- STAR-1 aims to address the critical needs for safety alignment in LRMs. Specifically, we begin by integrating existing open-source safety datasets from diverse sources. Then, we curate safety policies to generate policy-grounded deliberative reasoning samples. Lastly, we apply a GPT-4o-based safety scoring system to select training examples aligned with best practices. Experimental results show that fine-tuning LRMs with STAR-1 leads to an average 40% improvement in safety performance across four benchmarks, while only incurring a marginal decrease (e.g., an average of 1.1%) in reasoning ability measured across five reasoning tasks. Extensive ablation studies further validate the importance of our design principles in constructing STAR-1 and analyze its efficacy across both LRMs and traditional LLMs.

STAR-1: Safer Alignment of Reasoning LLMs with 1K Data

We study almost envy-freeness in house allocation, where
$m$ houses are to be allocated among $n$ agents so that
every agent receives exactly one house. An envy-free
allocation need not exist, and therefore we may have to
settle for relaxations. We study different aggregate
measures of envy as markers of fairness. In particular, we
define the amount of envy experienced by an agent $a$
w.r.t. an allocation to be the number of agents that agent
$a$ envies under that allocation. ~\cite{KMS2021} studied
the problem of minimizing the number of envious agents and
showed that it is NP-Complete to find allocations that
minimize the number of envious agents, even for binary
utilities, and this quantity is hard to approximate for
general utilities. In this paper, we explore envy
minimization in house allocation from a broader perspective
and prove algorithmic results not only for minimizing the
number of envious agents but for two other measures of envy
as well---minimizing the amount of maximum envy experienced
by any agent and minimizing the amount of total envy
experienced by all the agents put together.
We prove a host of algorithmic and hardness results. We
also suggest practical approaches for these problems via
integer linear program (ILP) formulations and report the
findings of our experimental evaluation of ILPs. Finally,
we study the price of fairness, which quantifies the loss
of welfare we must suffer due to the fairness requirements,
and present tight bounds as well as algorithms that
simultaneously optimize both welfare and fairness.

Downloads

Next from AAAI 2026

Optimizing Ride-Pooling Operations with Extended Pickup and Drop-Off Flexibility

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Optimizing Ride-Pooling Operations with Extended Pickup and Drop-Off Flexibility

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads