Singapore

Specifying informative and dense reward functions remains a pivotal challenge in Reinforcement Learning, as it directly affects the efficiency of agent training. In this work, we harness the expressive power of quantitative Linear Temporal Logic on finite traces (($\text{LTL}_f[\mathcal{F}]$)) to synthesize reward monitors that generate a dense stream of rewards for runtime-observable state trajectories. By providing nuanced feedback during training, these monitors guide agents toward optimal behaviour and help mitigate the well-known issue of sparse rewards under long-horizon decision making, which arises under the Boolean semantics dominating the current literature. Our framework is algorithm-agnostic and only relies on a state labelling function, and naturally accommodates specifying non-Markovian properties. Empirical results show that our quantitative monitors consistently subsume and, depending on the environment, outperform Boolean monitors in maximizing a quantitative measure of task completion and in reducing convergence time.

AAAI 2026

Expressive Temporal Specifications for Reward Monitoring

ml: reinforcement learning

krr: automated reasoning and theorem proving

krr: knowledge representation languages

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Tensor Compressive Sensing (TCS) has gained significant attention recently due to its strong ability to preserve the multidimensional structure of data. However, existing TCS methods face three critical challenges: 1) Biased approximation of tensor rank imposed by the convex surrogate Tensor Nuclear Norm (TNN) may interfere with the original low-rank structure of tensor data. 2) Vulnerability to non-Gaussian noise and outliers makes TCS methods highly susceptible to complex noise environments ubiquitous in real-world applications. 
3) Most of them are confined to third-order tensors and cannot handle high-order tensor data effectively. Being aware of these, we propose Robust Tensor Compressive Sensing (RTCS) based on M-estimators with three key innovations: 1) We design a novel M-estimator-based low-rank regularizer for order-$d$ ($d \geq 3$) tensors, which provides a superior approximation of tensor rank and better preserves the original data structure. 2) RTCS incorporates a robust Welsch estimator that adaptively mitigates the influence of complex noises and outliers in tensor recovery. 3) RTCS is developed to handle high-order tensors, thereby allowing for broader applicability beyond conventional third-order tensors. We further design an efficient algorithm based on the Alternating Direction Method of Multipliers (ADMM) to handle the complex optimization problem. Experiments show that RTCS consistently outperforms existing approaches across various noises.

Robust High-Order Tensor Compressive Sensing Based on M-Estimators

Despite the remarkable progress of Auto-Regressive (AR) image generation, its inference latency remains high due to the AR nature and the ambiguity of image tokens—even when employing Speculative Decoding (SD). Recent works have empirically addressed this issue using relaxed SD, but without theoretical grounding. In this paper, we establish the theoretical foundations of relaxed SD and propose Annealed Relaxation of Speculative Decoding (AnnealRSD), grounded in two key insights. First, by analyzing the total variation (TV) distance between the target model and relaxed SD, we derive the optimal resampling distribution that minimizes an upper bound of the TV distance. Second, perturbation analysis reveals an inherent annealing property of relaxed SD, motivating our annealed design. Together, these components enable AnnealRSD to achieve faster generation with comparable quality, or improved quality at the same latency, compared to existing methods. Extensive experiments on image generation validate the effectiveness of AnnealRSD, showing consistent improvements over prior approaches in speed and quality trade-offs.

Annealed Relaxation of Speculative Decoding for Faster Autoregressive Image Generation

The rapid advancement of generative models has increased the demand for detectors capable of generalizing across diverse and evolving generation techniques. However, existing methods—including those leveraging pre-trained vision-language models—often produce highly entangled representations, mixing task-relevant forensic cues (causal features) with spurious or irrelevant patterns (non-causal features), thus limiting generalization. To address this issue, we propose CausalCLIP, a framework that explicitly disentangles causal from non-causal features and employs targeted filtering guided by causal inference principles to retain only the most transferable and discriminative forensic cues. By modeling the generation process with a structural causal model and enforcing statistical independence through Gumbel-Softmax-based feature masking and Hilbert-Schmidt Independence Criterion (HSIC) constraints, CausalCLIP isolates stable causal features robust to distribution shifts. Extensive experiments across various generation models and datasets demonstrate that CausalCLIP significantly improves generalization, achieving gains of 4.06\% in average precision and 6.82\% in accuracy compared to existing state-of-the-art methods. The source code will be publicly available upon publication.

CausalCLIP: Causally-Informed Feature Disentanglement and Filtering for Generalizable Detection of Generated Images

Multimodal Large Language Models (MLLM) have enabled a wide range of advanced vision-language applications, including fine-grained object recognition and contextual understanding. When querying specific regions or objects in an image, human users naturally use "Visual Prompts" (VP) like bounding boxes to provide reference. However, no existing benchmark systematically evaluates the ability of MLLMs to interpret such VPs. This gap raises uncertainty about whether current MLLMs can effectively recognize VPs, an intuitive prompting method for humans, and utilize them to solve problems. To address this limitation, we introduce VP-Bench, aiming to assess MLLMs’ capability in VP perception and utilization. VP-Bench employs a two-stage evaluation framework: Stage 1 examines models’ ability to perceive VPs in natural scenes, utilizing 100K visualized prompts spanning 8 shapes and 355 attribute combinations. Stage 2 investigates the impact of VPs on downstream tasks, measuring their effectiveness in real-world problem-solving scenarios. Using VP-Bench, we evaluate 21 MLLMs, including proprietary systems (e.g., GPT-4o) and open-source models (e.g., InternVL-2.5 and Qwen2.5-VL). In addition, we conduct a comprehensive analysis of the factors influencing VP understanding, such as attribute variations and model scale. VP-Bench establishes a new reference framework for studying MLLMs’ ability to comprehend and resolve grounded referring questions.

VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models

Long-tail recognition remains challenging for pre-trained foundation models like CLIP, which often suffer from performance degradation under imbalanced data. This stems not only from the overfitting/underfitting issues during fine-tuning but, more fundamentally, from the inherent bias inherited from the long-tail distribution of their massive pre-training datasets. To address this, we propose \textbf{HGLTR} (Hierarchy-Guided Long-Tail Recognition), a novel framework that calibrates pre-trained models by injecting objective class hierarchy knowledge. We argue that the semantic proximity defined by a hierarchy provides a robust, data-independent prior to counteract model bias. Our method is specifically designed for vision-language models' dual-modality architecture. At the feature level, we align image embeddings with a hierarchy-guided text similarity structure. At the classifier level, we employ a distillation loss to regularize predictions using soft labels derived from the hierarchy. This dual-level injection effectively transfers knowledge from head to tail classes. Experiments on ImageNet-LT, Places-LT, and iNaturalist 2018 demonstrate that HGLTR achieves state-of-the-art performance, particularly in tail-class accuracy, highlighting the importance of leveraging structural priors to calibrate foundation models for real-world data.

HGLTR: Hierarchical Knowledge Injection for Calibrating Pre-trained Models in Long-Tail Recognition

Retrieval-Augmented Generation (RAG) improves large language models by retrieving external knowledge, often truncated into smaller chunks due to the input context window, which leads to information loss, resulting in response hallucinations and broken reasoning chains.
Moreover, traditional RAG retrieves unstructured knowledge, introducing irrelevant details that hinder accurate reasoning.
To address these issues, we propose TAdaRAG, a novel RAG framework for on-the-fly task-adaptive knowledge graph construction from external sources. 
Specifically, we design an intent-driven routing mechanism to a domain-specific extraction template, followed by supervised fine-tuning and a reinforcement learning-based implicit extraction mechanism, ensuring concise, coherent, and non-redundant knowledge integration. 
Evaluations on six public benchmarks and a real-world business benchmark (NowNewsQA) across three backbone models demonstrate that TAdaRAG outperforms existing methods across diverse domains and long-text tasks, highlighting its strong generalization and practical effectiveness.

TAdaRAG: Task Adaptive Retrieval-Augmented Generation via On-the-Fly Knowledge Graph Construction

Federated learning (FL) has enabled training of multilingual large language models (LLMs) on diverse and decentralized multilingual data, especially on low-resource languages. To improve client-specific performance, personalization via the use of parameter-efficient fine-tuning (PEFT) modules such as LoRA is common. This involves a personalization strategy (PS), such as the design of the PEFT adapter structures (e.g., in which layers to add LoRAs and what ranks) and choice of hyperparameters (e.g., learning rates) for fine-tuning. Instead of manual PS configuration, we propose FedP$^2$EFT, a federated learning-to-personalize method for multilingual LLMs in cross-device FL settings. Unlike most existing PEFT structure selection methods, which are prone to overfitting low-data regimes, FedP$^2$EFT collaboratively learns the optimal personalized PEFT structure for each client via Bayesian sparse rank selection. Evaluations on both simulated and real-world multilingual FL benchmarks demonstrate that FedP$^2$EFT largely outperforms existing personalized fine-tuning methods, while complementing other existing FL methods.

FedP²EFT: Federated Learning to Personalize PEFT for Multilingual LLMs

While large language models (LLMs) demonstrate emerging reasoning capabilities, current inference-time expansion methods incur prohibitive computational costs through exhaustive sampling. Through analyzing decoding trajectories, we observe that most next-token predictions align well with the golden output except a few critical tokens which leads to deviation. Inspired by this phenomenon, we propose a novel Hint-Practice Reasoning (HPR) framework that operationalizes this insight through two synergistic components: 1) a hinter (powerful LLM) that provides probabilistic guidance at critical decision points, and 2) a practitioner (efficient smaller model) that executes major reasoning steps. The framework's core innovation lies in Distributional Inconsistency Reduction (DIR), a theoretically-grounded metric that dynamically identifies intervention points by quantifying the divergence between practitioner's reasoning trajectory and hinter's expected distribution in a tree-structured probabilistic space. Through iterative tree updates guided by DIR, HPR reweights promising reasoning paths while deprioritizing low-probability branches. Experiments across arithmetic and commonsense reasoning benchmarks demonstrate HPR's state-of-the-art efficiency-accuracy tradeoffs: it achieves comparable performance to self-consistency and MCTS baselines while decoding only 1/5 tokens, and outperforms existing methods by at most 5.1% absolute accuracy while maintaining similar or lower FLOPs. Our code will be open-sourced to catalyze efficient reasoning research.

Efficient Thought Space Exploration Through Strategic Intervention

Theme detection is a fundamental task in user-centric dialogue systems, aiming to identify the latent topic of each utterance without relying on predefined schemas. Unlike intent induction, which operates within fixed label spaces, theme detection requires cross-dialogue consistency and alignment with personalized user preferences, posing significant challenges. Existing methods often struggle with sparse, short utterances for accurate topic representation and fail to capture user-level thematic preferences across dialogues. To address these challenges, we propose CATCH (Controllable Theme Detection with Contextualized Clustering and Hierarchical Generation), a unified framework that integrates three core components: (1) context-aware topic representation, which enriches utterance-level semantics using surrounding topic segments; (2) preference-guided topic clustering, which jointly models semantic proximity and personalized feedback to align themes across dialogue; and (3) a hierarchical theme generation mechanism designed to suppress noise and produce robust, coherent topic labels. Experiments on a multi-domain customer dialogue benchmark (DSTC-12) demonstrate the effectiveness of CATCH with 8B LLM in both theme clustering and topic generation quality.

CATCH: A Controllable Theme Detection Framework with Contextualized Clustering and Hierarchical Generation

Camera-based temporal 3D object detection has shown impressive results in autonomous driving, with offline models improving accuracy by using future frames. Knowledge distillation (KD) can be an appealing framework for transferring rich information from offline models to online models. However, existing KD methods overlook future frames, as they mainly focus on spatial feature distillation under strict frame alignment or on temporal relational distillation, thereby making it challenging for online models to effectively learn future knowledge. To this end, we propose a sparse query-based approach, Future Temporal Knowledge Distillation (FTKD), which effectively transfers future frame knowledge from an offline teacher model to an online student model. Specifically, we present a future-aware feature reconstruction strategy to encourage the student model to capture future features without strict frame alignment. In addition, we further introduce future-guided logit distillation to leverage the teacher's stable foreground and background context. FTKD is applied to two high-performing 3D object detection baselines, achieving up to 1.3 mAP and 1.3 NDS gains on the nuScenes dataset, as well as the most accurate velocity estimation, without increasing inference cost.

Downloads

Next from AAAI 2026

Robust High-Order Tensor Compressive Sensing Based on M-Estimators

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Robust High-Order Tensor Compressive Sensing Based on M-Estimators

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads