Singapore

Multimodal 3D object detection for autonomous driving, a task for real-world application, poses substantial challenges in maintaining robust performance under various perturbations and complex environmental conditions. However, most existing approaches primarily focus on performance optimization under relatively ideal scenarios or focus on one or few disturbing conditions (interference or adverse conditions), lacking systematic exploration of robustness against real-world factors, including high class imbalance, adverse weather conditions, sensor jitter and failures, and significant scene variations. To address this issue, we propose a robust multimodal 3D detector, termed RobusTor3D, which integrates robustness at both the structural and supervisory levels by blending the knowledge from Vision-Language Models (VLMs). Structurally, textual descriptions are incorporated to enhance the semantic richness and diversity of rare classes. This novel semantic injection operation compensates for the inherent class imbalance and modality weakness in conventional visual features. Furthermore, semantic alignment capability and robust representation by Vision-Language Knowledge Extraction (V-LKE) serve as semantic priors to complement modality-specific representations, significantly improving model adaptability. At the supervisory level, we propose a Scene-level Multimodal Consistency Learning (SMCL) strategy, which jointly enforces global semantic constraints across modalities, encouraging the learning of stable and abundant semantic representations. This special design specifically reduces the impact of spatial alignment, while notably enabling semantic compensation under modality-loss conditions. Extensive robustness experiments conducted on KITTI, KITTI-C, and CADC benchmarks evaluate five robustness aspects, including long-tail problem, adverse weather (rain, snow, fog, strong sunlight), sensor spatial misalignment and motion blur, modality loss, and cross-domain scenarios. The results show that the proposed RobusTor3D demonstrates superior robustness across all five evaluated aspects. It consistently outperforms the state-of-the-art methods under various challenging conditions.

AAAI 2026

RobusTor3D: Robust Multimodal 3D Object Detector for Autonomous Driving by Vision-Language Knowledge Blending

robust 3d detector

vision-language models

autonomous driving

multimodal learning

contrastive learning

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Multi-Agent Debate (MAD) is an emerging paradigm that leverages the reasoning abilities of Large Language Models (LLMs) by encouraging them to collaboratively solve problems through human-like discussions. However, current MAD methods typically constrain agents to follow fixed discussion pipelines, repeatedly applying the same discussion act for a predetermined number of rounds, which limits their effectiveness and adaptability in complex and diverse tasks. To address this limitation, we propose Analyze–Compose–Execute (ACE), a novel debate framework in which agents dynamically execute the discussion actions according to the dialogue context. To enable truly dynamic discussions, By analyzing the current responses of agents, ACE selects appropriate acts from a predefined Atomic Dis-
cussion Acts Library (ADAL), which are composed into a discussion action to be executed in the next round, to enable truly dynamic debate. We conduct extensive experiments on the challenging benchmark Big-Bench Hard (BBH) benchmark. ACE achieves state-of-the- art results on 17 out of 23 tasks, with an average performance gain of 8.5% across all tasks, demonstrating the effectiveness and robustness of our approach.

Analyze–Compose–Execute: A Dynamic Dialogue Framework for Multi-Agent Debate

Few-shot detection-based counters estimate the number of instances in the image specified only by a few test-time exemplars. 
A common approach to localize objects across multiple sizes is to merge backbone features of different resolutions.
Furthermore, to enable small object detection in densely populated regions, the input image is commonly upsampled and tiling is applied to cope with the increased computational and memory requirements. Because of these ad-hoc solutions, existing counters struggle with images containing diverse-sized objects and densely populated regions of small objects.
We propose GeCo2, an end-to-end few-shot counting and detection method that explicitly addresses the object scale issues. 
A new dense query representation gradually aggregates exemplar-specific feature information across scales that leads to high-resolution dense queries that enable detection of large as well as small objects.
GeCo2 surpasses state-of-the-art few-shot counters in counting as well as detection accuracy by $\sim$10\% while running $\sim$3$\times$ faster at smaller GPU memory footprint.
Code will be released upon acceptance.

Generalized-Scale Object Counting with Gradual Query Aggregation

Multimodal large language models (MLLMs), which integrate language and visual cues for problem-solving, are crucial for advancing artificial general intelligence (AGI). However, current benchmarks for measuring the intelligence of MLLMs suffer from limited scale, narrow coverage, and unstructured knowledge, offering only static and undifferentiated evaluations. To bridge this gap, we introduce MDK12-Bench, a large-scale multidisciplinary benchmark built from real-world K–12 exams spanning six disciplines with 141K instances and 6,225 knowledge points organized in a six-layer taxonomy. Covering five question formats with difficulty and year annotations, it enables comprehensive evaluation to capture the extent to which MLLMs perform over four dimensions: 1) difficulty levels, 2) temporal (cross-year) shifts, 3) contextual shifts, and 4) knowledge-driven reasoning. We propose a novel dynamic evaluation framework that introduces unfamiliar visual, textual, and question form shifts to challenge model generalization while improving benchmark objectivity and longevity by mitigating data contamination. We further evaluate knowledge-point reference-augmented generation (KP-RAG) to examine the role of knowledge in problem-solving. Key findings reveal limitations in current MLLMs in multiple aspects and provide guidance for enhancing model robustness, interpretability, and AI-assisted education.

MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models

Enabling intuitive, language-driven interaction with surgical scenes is a critical step toward intelligent operating rooms and autonomous surgical robotic assistance. However, the task of referring segmentation, localizing surgical instruments based on natural language descriptions, remains underexplored in surgical videos, with existing approaches struggling to generalize due to reliance on static visual cues and predefined instrument names. In this work, we introduce SurgRef, a novel motion-guided framework that grounds free-form language expressions in instrument motion, capturing how tools move and interact across time, rather than what they look like. This allows models to understand and segment instruments even under occlusion, ambiguity, or unfamiliar terminology. To train and evaluate SurgRef, we present Ref-IMotion, a diverse, multi-institutional video dataset with dense spatiotemporal masks and rich motion-centric expressions. SurgRef achieves state-of-the-art accuracy and generalization across surgical procedures, setting a new benchmark for robust, language-driven surgical video understanding.

Where It Moves, It Matters: Referring Surgical Instrument Segmentation via Motion

Multimodal Large Language Models (MLLMs) have garnered significant attention recently and demonstrate outstanding capabilities in various tasks such as OCR, VQA, captioning, etc. However, hallucination remains a persistent issue. While numerous methods have been proposed to mitigate hallucinations, achieving notable improvements, these methods primarily focus on mitigating hallucinations related to object/noun concepts. Verb concepts, which are crucial for understanding human actions, have been largely overlooked. In this paper, to the best of our knowledge, we are the first to investigate the verb hallucination phenomenon of MLLMs from various perspectives. Our findings reveal that most state-of-the-art MLLMs suffer from severe verb hallucination. To assess the effectiveness of existing mitigation methods for object concept hallucination in relation to verb hallucination, we evaluated these methods and found that they do not effectively address verb hallucination. To address this issue, we propose a baseline method based on fine-tuning with rich verb knowledge, achieving decent superiority. The experiment results demonstrate that our method significantly reduces hallucinations related to verbs.

Verb Mirage: Unveiling and Assessing Verb Concept Hallucinations in Multimodal Large Language Models

Recent findings suggest that consecutive layers of neural networks with the ReLU activation function fold the input space during the learning process. While many works hint at this phenomenon, an approach to quantify the folding was only recently proposed by means of a space folding measure based on the Hamming distance in the discrete activation space. We generalize the space folding measure to a wider class of activation functions through the introduction of equivalence classes of input data. We then analyze its mathematical and computational properties. Lastly, we link the folding to geometry of adversarial attacks. We underpin our claims with an experimental evaluation.

Exploiting Space Folding by Neural Networks

Mixture of Experts (MoE) models have emerged as a promising approach to scale language models efficiently by activating only a subset of parameters for each input. However, deploying these models under GPU memory constraints remains challenging, as existing offloading strategies incur significant overhead from CPU-GPU data transfers. While prior work has explored prefetching techniques to mitigate this bottleneck, these methods require costly fallback mechanisms when predictions fail. Since expert transfers cannot be canceled once initiated, the correct experts need to be loaded on demand sequentially, introducing additional latency. To address this, we present CommitMoE, a novel approach featuring a Commit Router that makes execution decisions based on expert predictions without fallback mechanisms. Our key insight reveals that router certainty strongly correlates with prediction accuracy, while in low-certainty scenarios, the model output demonstrates inherent robustness to expert selection. Leveraging this insight to design a systems-level solution, CommitMoE achieves 1.3× to 9.4× faster inference speeds across different environments and datasets compared to state-of-the-art offloading frameworks while maintaining model quality.

CommitMoE: Efficient Fallback-Free MoE Inference with Offloading Under GPU Memory Constraints

While recent methods automate concept generation using Large Language Models (LLMs) and Vision-Language Models (VLMs), they still face three fundamental challenges: poor visual grounding, concept redundancy, and the absence of principled metrics to balance predictive accuracy and concept compactness. We introduce \textbf{PS-CBM}, a \textbf{P}artially \textbf{S}hared \textbf{CBM} framework that addresses these limitations through three core components: (1) a multimodal concept generator that integrates LLM-derived semantics with exemplar-based visual cues; (2) a Partially Shared Concept Strategy that merges concepts based on activation patterns to balance specificity and compactness; and (3) Concept-Efficient Accuracy (CEA), a post-hoc metric that jointly captures both predictive accuracy and concept compactness. Extensive experiments on eleven diverse datasets show that PS-CBM consistently outperforms state-of-the-art CBMs, improving classification accuracy by 1.0\%--7.4\% and CEA by 2.0\%--9.5\%, while requiring significantly fewer concepts. These results underscore PS-CBM's effectiveness in achieving both high accuracy and strong interpretability.

Partially Shared Concept Bottleneck Models

Gradient Boosting Decision Trees (GBDTs) are widely used in industry and academia for their high accuracy and efficiency, particularly on structured data. However, the subject of watermarking GBDT models remains underexplored, especially compared to neural networks. In this work, we present the first robust watermarking framework tailored to GBDT models, utilizing in-place fine-tuning to embed imperceptible and resilient watermarks. We propose four embedding strategies, each designed to minimize impact on model accuracy while ensuring watermark robustness. Through experiments across diverse datasets, we demonstrate that our methods achieve high watermark embedding rates, low accuracy degradation, and strong resistance to post-deployment fine-tuning.

Robust Watermarking on Gradient Boosting Decision Trees

We present NoReGeo, a novel benchmark designed to evaluate the intrinsic geometric understanding of large language models (LLMs) without relying on reasoning or algebraic computation. Unlike existing benchmarks that primarily assess models' proficiency in reasoning-based geometry-where solutions are derived using algebraic methods-NoReGeo focuses on evaluating whether LLMs can inherently encode spatial relationships and recognize geometric properties directly. Our benchmark comprises 2,500 trivial geometric problems spanning 25 categories, each carefully crafted to be solvable purely through native geometric understanding, assuming known object locations. We assess a range of state-of-the-art models on NoReGeo, including frontier models like GPT-4, observing that even the most advanced systems achieve a maximum of 65\% accuracy in binary classification tasks. Further, our ablation experiments demonstrate that such geometric understanding does not emerge through fine-tuning alone, indicating that effective training for geometric comprehension requires a specialized approach from the outset. Our findings highlight a significant gap in current LLMs' ability to natively grasp geometric concepts, providing a foundation for future research toward models with true geometric cognition.

Downloads

Next from AAAI 2026

Analyze–Compose–Execute: A Dynamic Dialogue Framework for Multi-Agent Debate

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Analyze–Compose–Execute: A Dynamic Dialogue Framework for Multi-Agent Debate

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads