Singapore

With the rapid integration of large language models (LLMs) into medical decision-support aids, ensuring reliability in reasoning steps—not just final answers—is increasingly critical. Two key safety dimensions are Chain-of-Thought (CoT) faithfulness, which assesses alignment of the model’s reasoning process with both its response and medical facts, and sycophancy, an emergent misalignment where models follow misleading cues instead of factual correctness. Yet existing benchmarks tend to prioritize performance evaluation, frequently collapsing nuanced safety vulnerabilities into a single accuracy score. To fill this gap, we introduce MedOmni-45°, a benchmark and evaluation workflow explicitly designed to quantify the safety–performance trade-off in LLMs under manipulative hint conditions. The benchmark contains 1,804 reasoning-focused medical questions across six clinical specialties and three task types, including 500 publicly comparable items from MedMCQA. Each question is systematically augmented with seven manipulative hint types, each embedding two distinct misleading cue variants, along with a No-Hint baseline, resulting in approximately 27,000 unique inputs. These inputs are then evaluated across seven LLMs spanning open- and closed-source, general-purpose and medical-specific, and base versus reasoning-enhanced variants, amounting to over 189K total inference instances. Three orthogonal metrics (Accuracy, CoT-Faithfulness, Anti-Sycophancy) are combined into a composite score visualized via a 45° safety–performance plot. Results reveal a universal trade-off, with no model surpassing the ideal diagonal. Open-source QwQ-32B approaches closest at 43.81°, demonstrating notable safety while not surpassing others in performance. MedOmni-45° thus highlights critical vulnerabilities of LLMs in reasoning-oriented medical tasks, offering a robust benchmark for future alignment research.

AAAI 2026

MedOmni-45°: A Safety–Performance Benchmark for Reasoning-Oriented LLMs in Medicine

safety–performance trade-off

reasoning-oriented

cot faithfulness

large language models

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Reconstructing dense geometry for dynamic scenes from a monocular video is a critical yet challenging task.
Recent memory-based methods enable efficient online reconstruction, but they fundamentally suffer from a Memory Demand Dilemma:
The memory representation faces an inherent conflict between the long-term stability required for static structures and the rapid, high-fidelity detail retention needed for dynamic motion.
This conflict forces existing methods into a compromise, leading to either geometric drift in static structures or blurred, inaccurate reconstructions of dynamic objects.
To address this dilemma, we propose Mem4D, a novel framework that decouples the modeling of static geometry and dynamic motion. Guided by this insight, we design a dual-memory architecture: 
1) The Transient Dynamics Memory (TDM) focuses on capturing high-frequency motion details from recent frames, enabling accurate and fine-grained modeling of dynamic content;
2) The Persistent Structure Memory (PSM) compresses and preserves long-term spatial information, ensuring global consistency and drift-free reconstruction for static elements.
By alternating queries to these specialized memories, Mem4D simultaneously maintains static geometry with global consistency and reconstructs dynamic elements with high fidelity.
Experiments on challenging benchmarks demonstrate that our method achieves state-of-the-art or competitive performance while maintaining high efficiency. Codes will be publicly available at https://github.com/Mem4D/Mem4D.

Mem4D: Decoupling Static and Dynamic Memory for Dynamic Scene Reconstruction

Continual learning for action recognition is a critical capability for next-generation Extended Reality (XR) systems. Yet it faces a severe real-world challenge: strict user privacy that prohibits data rehearsal. While recent prompt-based continual learning methods show promise, we argue their flat, single-granularity design is structurally mismatched to the complexity of human actions. This monolithic architecture fails to model the inherent hierarchical structure of individual actions and overlooks standard action primitives shared across tasks, resulting in suboptimal performance and hindered knowledge transfer. To overcome this limitation, we propose DPCA, a novel spatio-temporal continual learning framework with multi-granularity adaptive prompting. DPCA learns three synergistic components to resolve this mismatch. First, the task-specific prompter employs a multi-granularity query system to capture the unique, compositional semantics of each action. Second, the task-agnostic prompter learns a globally shared vocabulary of ``action primitives," providing a stable and generalizable knowledge base to mitigate catastrophic forgetting. Furthermore, we introduce a Dissimilarity Attention Rectification at each granularity level, which leverages a reverse attention mechanism to model class-agnostic background information, effectively alleviating overfitting. The synergy between these components enables robust model adaptation without requiring access to past data. Rigorous experiments on the NTU RGB+D benchmark, under a strict rehearsal-free, few-shot protocol, confirm that DPCA establishes a new state-of-the-art, advancing the realization of brilliant and privacy-respecting XR systems.

Decomposing Prompts, Composing Actions: A Multi-Granularity Prompting Approach for Incremental Action Learning

Recent advances in multimodal large language models (MLLMs) have significantly improved medical AI, enabling it to unify the understanding of visual and textual information. However, as medical knowledge continues to evolve, it is critical to enable these models to efficiently update outdated or incorrect information without retraining from scratch. Although textual knowledge editing has been widely studied, there is still a lack of systematic benchmarks for multimodal medical knowledge editing involving image and text modalities. To fill this gap, we present MedMKEB, the first comprehensive benchmark designed to evaluate the reliability, generality, locality, portability, and robustness of knowledge editing in medical multimodal large language models. MedMKEB is built on a high quality medical visual question-answering dataset and enriched with carefully constructed editing tasks including counterfactual correction, semantic generalization, knowledge transfer, and adversarial robustness. We incorporate human expert validation to ensure the accuracy and reliability of the benchmark. Extensive experiments on state-of-the-art general and medical MLLMs demonstrate the limitations of existing knowledge editing methods in the medical domain, highlighting the need to develop specialized editing strategies. MedMKEB will serve as a standard benchmark to promote the development of trustworthy and efficient medical knowledge editing algorithms.

MedMKEB: A Comprehensive Knowledge Editing Benchmark for Medical Multimodal Large Language Models

Ultra-low altitude UAVs (below 120 meters) are gaining importance in the booming low-altitude economy, where GNSS signals are often unreliable or unavailable. Vision-based localization emerges as a promising alternative; however, existing benchmarks are not designed for ultra-low flight and typically adopt pinhole cameras with limited field of view, making them less effective in handling occlusions and repetitive textures near the ground. To address these limitations, we introduce the first panoramic UAV localization dataset tailored for ultra-low altitude scenarios. Built on a four-fisheye-camera system in the high-fidelity RflySim platform, our dataset captures diverse conditions — including day/night cycles, extreme weather, and dynamic obstacles — and contains over hundreds of thousands of frames. It is further enhanced with real-world UAV panoramic data to narrow the sim-to-real gap and will be continuously updated for broader applicability. Comprehensive experiments confirm the effectiveness and transferability of our dataset, establishing it as a robust benchmark for future research in vision-based UAV localization.

RflyPano: A Panoramic Benchmark for Ultra-low Altitude UAV Localization Powered by RflySim

Large Language Models (LLMs) fine-tuning techniques not only improve the adaptability to diverse downstream tasks, but also mitigate adverse effects of model quantization. 
Despite this, conventional quantization suffers from its structural limitation that hinders flexibility during the fine-tuning and deployment stages. 
Practical on-device tasks demand different quantization precisions (i.e. different bit-widths), e.g., understanding tasks tend to exhibit higher tolerance to reduced precision compared to generation tasks. 
Conventional quantization, typically relying on scaling factors that are incompatible across bit-widths, fails to support the on-device switching of precisions when confronted with complex real‑world scenarios.
To overcome the dilemma, we propose OTARo, a novel method that enables on-device LLMs to flexibly switch quantization precisions while maintaining performance robustness through once fine-tuning. 
OTARo introduces Shared Exponent Floating Point (SEFP), a distinct quantization mechanism, to produce different bit-widths through simple mantissa truncations of a single model.
Moreover, to achieve bit-width robustness in downstream applications, OTARo performs a learning process toward losses induced by different bit-widths.
The method involves two critical strategies: (1) Exploitation-Exploration Bit-Width Path Search (BPS), which iteratively updates the search path via a designed scoring mechanism; (2) Low-Precision Asynchronous Accumulation (LAA), which performs asynchronous gradient accumulations and delayed updates under low bit-widths. 
Experiments on popular LLMs, e.g., LLaMA3.2-1B, LLaMA3-8B, demonstrate that OTARo achieves consistently strong and robust performance for all precisions.

OTARo: Once Tuning for All Precisions Toward Robust On-Device LLMs

Cloth-changing person re-identification (CC-ReID) aims to identify individuals across non-overlapping cameras despite clothing variations. Existing methods are often constrained by two primary limitations: approaches using auxiliary modalities typically rely on a single specific cue, limiting their robustness, while feature disentanglement methods struggle with discrete labels that create inconsistencies between ground truth labels and modality semantic similarity. To overcome these limitations, we propose **DRDnet**, a unified framework that synergistically integrates dual auxiliary cues and advanced relation modeling. Specifically, our Dual-Stream Disentanglement (DSD) module leverages textual descriptions and parsing images to decouple clothing factors through high-level semantic supervision and pixel-level operations, yielding robust clothing-agnostic features. Simultaneously, our Modal Relation Modeling (MRM) module constructs feature memory banks and employs adaptive soft label smoothing, effectively enhancing image-text semantic alignment and reinforcing identity consistency across clothing changes. We evaluate DRDnet on several CC-ReID benchmarks to demonstrate its effectiveness and provide state-of-the-art performance across all benchmarks.

Dual-stream Relation-modeling Disentanglement for Cloth-Changing Person Re-Identification

Vision-Language-Action (VLA) models revolutionize robotic systems by enabling end-to-end perception-to-action pipelines that integrate multiple sensory modalities, such as visual signals processed by cameras and auditory signals captured by microphones. This multi-modality integration allows VLA models to interpret complex, real-world environments using diverse sensor data streams. Given the fact that VLA-based systems heavily rely on the sensory input, the security of VLA models against physical-world sensor attacks remains critically underexplored. 

To address this gap, we present the first systematic study of physical sensor attacks against VLAs, quantifying the influence of sensor attacks and investigating the defenses for VLA models. We introduce a novel ``Real-Sim-Real" framework that automatically simulates physics-based sensor attack vectors, including six attacks targeting cameras and two targeting microphones, and validates them on real robotic systems. Through large-scale evaluations across various VLA architectures and tasks under varying attack parameters, we demonstrate significant vulnerabilities, with susceptibility patterns that reveal critical dependencies on task types and model designs. We further develop an adversarial-training-based defense that enhances VLA robustness against out-of-distribution physical perturbations caused by sensor attacks while preserving model performance. Our findings expose an urgent need for standardized robustness benchmarks and mitigation strategies to secure VLA deployments in safety-critical environments.

Phantom Menace: Exploring and Enhancing the Robustness of VLA Models Against Physical Sensor Attacks

Software vulnerabilities have increased sharply, underscoring the growing urgency for effective detection methods. Although large language model (LLM) based methods have shown promise in this task, current state-of-the-art LLM approaches struggle with functions that have long contexts. In this paper, we propose CTX-Coder, a context-enhanced vulnerability detection framework that enables LLMs to selectively focus on relevant contextual functions. To achieve this, we represent the contextual functions as embeddings and integrate them with the target code via cross-attention, thereby enhancing the model's ability to capture contextual information. Furthermore, to equip the model with the ability to recognize these embedding features, we propose a two-stage pretraining pipeline. We also introduce a new dataset, CTX-VUL, which addresses the limitations of existing datasets that either lack contextual information for vulnerable functions or are not publicly available. Extensive experiments demonstrate that CTX-Coder (10B) significantly outperforms baseline models with even larger parameters, such as Qwen2.5-14B and SecGPT. As the input code length increases, CTX-Coder’s F1 score drops by only 5.01\%, while other models degrade by 25\% to 41.5\%, showing strong robustness to long-context scenarios and the effectiveness of our design.

CTX-Coder: Cross-Attention Architectures Empower LLMs for Long-Context Vulnerability Detection

Video generation using Large Language Models (LLMs) has shown promising potential, effectively leveraging the extensive LLM infrastructure to provide a unified framework for multimodal understanding and content generation. However, these methods face critical challenges, i.e., token redundancy and inefficiencies arising from long sequences, which constrain their performance and efficiency compared to diffusion-based approaches. In this study, we investigate the impact of token redundancy in LLM-based video generation by information-theoretic analysis and propose Vision Representation Compression (VRC), a novel framework designed to achieve more in both performance and efficiency with less video token representations. VRC introduces learnable representation compressor and decompressor to compress video token representations, enabling autoregressive next-sequence prediction in a compact latent space. Our approach reduces redundancy, shortens token sequences, and improves model's ability to capture underlying video structures. Our experiments demonstrate that VRC reduces token sequence lengths by a factor of 4, achieving more than 9~14x acceleration in inference while maintaining performance comparable to state-of-the-art video generation models. VRC not only accelerates the inference but also significantly reduces memory requirements during both model training and inference.

Less Is More: Vision Representation Compression for Efficient Video Generation with Large Language Models

Foundational vision-language models (VLMs), such as CLIP, are emerging as a promising paradigm in vision tasks due to their strong generalization ability. Nevertheless, adapting them to downstream tasks remains challenging, especially in biomedical imaging, where scarce annotations, low-contrast features and complex patterns hinder model adaptation. Thus, prompt tuning is employed to facilitate the adaptation of VLMs. However, current prompt tuning methods like Context Optimization (CoOp) mainly learn a single yet static prompt which is applied to all images, and such one-size-fits-all prompt cannot describe the case-specific diagnostic cues in biomedical data, compromising the adaptation of VLMs.
To this end, we propose a Dynamic Prompt Policy learning method that enables efficient adaptation of Biomedical VLMs (BioDPP) for accurate and highly generalizable few-shot biomedical image classification. Specifically, we conceptualize the learnable context as an agent, and present a paradigm of learning a dynamic prompting policy, rather than obtaining a single yet static prompt. Wherein, a dual-reward mechanism is developed to guide policy learning via the feedback on both classification decision and the consistency between the prompt and the context, steering the agent to generate context-aware prompts. Moreover, we devise adaptive baseline stabilization to dynamically regulate reward advantage value throughout the training process, enabling policy refinement in a complex reward space tailored to biomedical VLMs. Extensive experiments are conducted on 10 biomedical datasets, and the results reveal that our BioDPP achieves superior performance, demonstrating more efficient prompt optimization in biomedical VLMs.

Content not yet available

Next from AAAI 2026

Mem4D: Decoupling Static and Dynamic Memory for Dynamic Scene Reconstruction

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES