Singapore

Tool-use capabilities fundamentally transform large language models (LLMs) from passive language generators into active agents with real-world utility, drawing intense research focus. Yet, their emergent nature renders traditional scaling laws ineffective for early-stage prediction, obstructing principled model design and efficient training. In this work, we propose a proxy-task perspective that predicts tool-use capabilities by measuring early model performance on selected non-emergent proxy tasks. Our method quantifies two properties of each proxy task: alignment, which reflects how well it captures tool-use trajectories, and stability, which indicates how consistently it behaves across training conditions. These properties are used to weight predictive signals. Theoretically, we formalize how these weighted signals approximate emergent tool use through bounded extrapolation under relaxed assumptions. Empirically, we validate our approach across training checkpoints, model scales, and data setups. Results show that a carefully weighted ensemble of proxy tasks can accurately rank downstream tool-use ability long before it arises. Our findings provide new theoretical foundations and practical tools for efficient training and capability planning, and advance the understanding of how complex abilities arise in LLMs.

AAAI 2026

Predicting Emergent Tool Use in LLMs Before It Emerges: A Proxy Perspective

capability prediction

proxy tasks

tool use

emergent abilities

large language models

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Retrieval-augmented generation (RAG) improves the response quality of large language models (LLMs) by retrieving knowledge from external databases. Typical RAG approaches split the text database into chunks, organizing them in a flat structure for efficient searches. To better capture the inherent dependencies and structured relationships across the text database, researchers propose to organize textual information into an indexing graph, known as graph-based RAG. However, we argue that the limitation of current graph-based RAG methods lies in the redundancy of the retrieved information, rather than its insufficiency. Moreover, previous methods use a flat structure to organize retrieved information within the prompts, leading to suboptimal performance. To overcome these limitations, we propose PathRAG, which retrieves key relational paths from the indexing graph, and converts these paths into textual form for prompting LLMs. Specifically, PathRAG effectively reduces redundant information with flow-based pruning, while guiding LLMs to generate more logical and coherent responses with path-based prompting. Experimental results show that PathRAG consistently outperforms state-of-the-art baselines across six datasets and five evaluation dimensions.

PathRAG: Pruning Graph-based Retrieval Augmented Generation with Relational Paths

Remote sensing imagery poses a distinct challenge for semantic segmentation due to its inherent fractal complexity and the diversity of geometric structures present in real-world geospatial scenes. Conventional Euclidean-based models typically assume spatial uniformity; however, such assumptions often break down when confronted with objects exhibiting markedly different structural characteristics—such as roads versus vegetation—thereby complicating the feature representation process. Hyperbolic space offers a theoretically grounded alternative for modeling such hierarchical and heterogeneous patterns, yet fully replacing Euclidean geometry incurs significant computational overhead. To address this trade-off, we propose Geometry-Aware Adaptive Routing (GAAR), a novel module that facilitates Hyperbolic-aware routing by dynamically allocating high-level features to either Euclidean or Hyperbolic subspaces through a learnable binary gating mechanism, informed by structural priors learned during training. To further promote routing stability and geometric consistency, we introduce Geometry-Aware Deterministic Regularization (GADR), a regularization strategy that encourages confident, structure-aligned assignments. GAAR is plug-and-play and integrates seamlessly into existing segmentation architectures. Experiments on three challenging Remote Sensing Image Semantic Segmentation (RSISS) benchmarks demonstrate that our approach consistently outperforms state-of-the-art (SOTA) methods, particularly in geometrically complex regions, offering a scalable and effective solution to the limitations of purely Euclidean modeling. Code and models will be released upon acceptance.

Beyond Euclidean Assumptions: Geometry-Aware Adaptive Routing for Remote Sensing Segmentation

Despite the prevalent assumption of uniform variable importance in long-term time series forecasting models, real-world applications often exhibit causal relationships and varying data acquisition costs. Specifically, cost‐effective exogenous data (e.g., local weather) can unilaterally influence dynamics of endogenous variables, like canopy or lake surface temperature. Exploiting these links enables more efficient forecasts when exogenous inputs are readily available. Transformer-based models capture long-range dependencies but incur high computation and suffer from permutation invariance. Patch-based variants improve efficiency yet can miss local temporal patterns. To efficiently exploit informative signals across both the temporal dimension and exogenous variables, this study proposes XLinear, a lightweight time series forecasting model built upon Multi-Layer Perceptrons (MLPs). XLinear uses MLPs and sigmoid activation to extract both temporal patterns and variate-wise dependencies, and employs a global token mapped from an endogenous variable as a pivotal hub for interacting with exogenous variables. Its prediction head then integrates these signals to forecast the endogenous series. It is evaluated on seven widely adopted benchmarks and five datasets with exogenous variables. Compared with state-of-the-art models, XLinear delivers superior accuracy and efficiency for both multivariate forecasts and univariate forecasts influenced by exogenous inputs.

XLinear: A Lightweight and Accurate MLP-Based Model for Long-Term Time Series Forecasting with Exogenous Inputs

Audit games are an important variant of the Stackelberg security game, a widely studied game-theoretic model over the past years. It has been acknowledged by many researchers that a pre-audit phase can notably enhance the audit's efficiency by informing and directing the following audit procedures.

In this paper, we model the above process with a two-stage audit game. The game encompasses two stages: an investigation stage where the auditor gathers information about potential policy breaches, and an audit stage where the auditor allocates the audit resources based on the investigation results.

We first show that it is NP-hard to compute the auditor's optimal two-stage audit strategy. To circumvent the complexity issue, we consider a restricted strategy space, and show that the optimal strategy in the restricted space can be determined by solving a polynomial number of convex optimization problems. Finally, we conduct extensive experiments to evaluate the effect of introducing the initial investigation stage and our algorithm. Our experiments show that even a small budget for the initial investigations can significantly enhance the defender's utility.

The Power of Initial Investigation in Audit Games

The learning dynamics of modern neural networks remain an open problem in deep learning. The *Neural Tangent Kernel* (NTK) offers an elegant description of training dynamics in the infinite‑width limit, yet its classical formulation assumes a static data set. Modern model training practice departs from this strong assumption through the use of on‑the‑fly data augmentations (e.g. additive noise). In this work, we conduct an NTK-driven analysis of how data transformations affect a neural net's evolution in the function space. Our theoretical contributions characterize how repeated Gaussian perturbations from NTK-derived covariances can steer neural-net optimizations toward user‑specified behavior. These theoretical insights are empirically validated by controlled experiments. Taken together, our results lay the foundation for a promising future research direction that transforms the NTK from a descriptive to a *prescriptive* tool, enabling control of neural net training trajectories and behavior of inference generalization with grounded interventions.

Neural Tangent Kernels Under Stochastic Data Augmentation

As Large Language Models (LLMs) continue to grow in size, storing and transmitting them on edge devices becomes increasingly challenging. Traditional methods like quantization and pruning struggle to achieve extreme compression of LLMs without sacrificing accuracy. In this paper, we introduce PocketLLM, a novel approach to compress LLMs in a latent space via meta-networks. A simple encoder network is proposed to project the weights of LLMs into discrete latent vectors, which are then represented using a compact codebook. A lightweight decoder network is employed to map the codebook's representative vectors back to the original weight space. This method allows for significant compression of the large weights in LLMs, consisting solely of a small decoder, a concise codebook, and an index. Extensive experiments show that PocketLLM achieves superior performance even at significantly high compression ratios, e.g., compressing Llama 2-7B by 10x with a negligible drop in accuracy.

PocketLLM: Ultimate Compression of Large Language Models via Meta Networks

The proliferation of Large Language Models (LLMs) has raised concerns over training data privacy. Membership Inference Attacks (MIA), aiming to identify whether specific data was used for training, pose significant privacy risks. However, existing MIA methods struggle to address the scale and complexity of modern LLMs. This paper introduces OR-MIA, a novel MIA framework inspired by model optimization and input robustness. First, training data points are expected to exhibit smaller gradient norms due to optimization dynamics. Second, member samples show greater stability, with gradient norms being less sensitive to controlled input perturbations. OR-MIA leverages these principles by perturbing inputs, computing gradient norms, and using them as features for a robust classifier to distinguish members from non-members. Evaluations on LLMs (70M to 6B parameters) and various datasets demonstrate that OR-MIA outperforms existing methods, achieving over 90% accuracy. Our findings highlight a critical vulnerability in LLMs and underscore the need for improved privacy-preserving training paradigms.

Optimization and Robustness-Informed Membership Inference Attacks for LLMs

Knowledge editing aims to update specific knowledge in Large Language Models (LLMs) without retraining the entire model. However, existing methods generally struggle to manage the ripple effects of knowledge updates, particularly in multi-hop reasoning tasks, where conflicts between old and new information often lead to shifts in reasoning chains and degraded consistency. To address this issue, a ripple-aware knowledge editing framework, namely \emph{EchoEdit}, is proposed. \emph{EchoEdit} introduces the RippleGraph to explicitly model potentially affected knowledge regions and employs a RippleRule generator to dynamically produce diffusion rules, precisely constraining knowledge propagation. Furthermore, we distill a Chain-of-Thought (CoT) planner from an external teacher model, which decouples complex reasoning chain planning into RippleGraph-guided reasoning, thereby alleviating the reasoning burden on low-resource LLMs in multi-hop tasks. Experimental results on the MQuAKE and RIPPLEEDITS multi-hop reasoning benchmarks demonstrate that \emph{EchoEdit} significantly outperforms existing mainstream methods, effectively enhancing post-edit reasoning consistency and generalization capabilities.

EchoEdit: Consistent Multi-Hop Question Answering via Ripple Control in Knowledge Editing

Vision-Language Models (VLMs) excel at extracting salient visual features for given query images, thus exhibiting promising visual recognition performance. However, VLMs would encounter significant degradation in fine-grained scenarios due to their deficiency in distinguishing nuanced differences among candidate categories. As a remedy, we draw inspiration from the ``System 1 \& System 2" cognitive theory of humans, paving the way to achieve fine-grained recognition for VLMs. To be specific, we observe that VLMs naturally align with System 1, quickly identifying candidate categories but leaving easily-confused ones unresolved. Based on the observation, we propose System-2 enhanCed visuAl recogNition (SCAN), a novel plug-and-play approach that makes VLMs aware of the nuanced differences. In brief, SCAN first specifies and abstracts the discriminative attributes for the confused candidate categories and query images by resorting to off-the-shelf large foundation models, respectively. After that, SCAN adaptively integrates the salient visual features from System 1 with the nuanced differences derived from System 2, resolving confusion in candidates with estimated uncertainty. Extensive experiments on eight widely used fine-grained recognition benchmarks against 10 state-of-the-art baselines verify the effectiveness and superiority of SCAN. The code will be released upon acceptance.

Endowing Vision-Language Models with System 2 Thinking for Fine-grained Visual Recognition

The recent success of large language models (LLMs) has sparked a growing interest in training large-scale models. As the model size continues to scale, concerns are growing about the depletion of high-quality, well-curated training data. This has led practitioners to explore training approaches like Federated Learning (FL), which can leverage the abundant data on edge devices while maintaining privacy. However, the decentralization of training datasets in FL introduces challenges to scaling large models, a topic that remains under-explored. This paper fills this gap and provides qualitative insights on generalizing the previous model scaling experience to federated learning scenarios. Specifically, we derive a PAC-Bayes (Probably Approximately Correct Bayesian) upper bound for the generalization error of models trained with stochastic algorithms in federated settings and quantify the impact of distributed training data on the optimal model size by finding the analytic solution of model size that minimizes this bound. Our theoretical results demonstrate that the optimal model size has a negative power law relationship with the number of clients if the total training compute is unchanged. Besides, we also find that switching to FL with the same training compute will inevitably reduce the upper bound of generalization performance that the model can achieve through training, and that estimating the optimal model size in federated scenarios should depend on the average training compute across clients. Furthermore, we also empirically validate the correctness of our results with extensive training runs on different models, network settings, and datasets.

Content not yet available

Next from AAAI 2026

PathRAG: Pruning Graph-based Retrieval Augmented Generation with Relational Paths

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES