Singapore

Large Language Models (LLMs) have shown remarkable success on a wide range of math and reasoning benchmarks. However, we observe that they often struggle when faced with unreasonable math problems. Instead of recognizing these issues, models frequently proceed as if the problem is well-posed, producing incorrect answers or falling into overthinking and verbose self-correction. To systematically investigate this overlooked vulnerability, we propose the Unreasonable Math Problems (UMP) benchmark, designed to evaluate LLMs&#39; ability to detect and respond to unreasonable math problem statements. Based on extensive experiments covering 19 LLMs, we find that even state-of-the-art general models like GPT-4o struggle on UMP. While reasoning models such as DeepSeek-R1 demonstrate a higher sensitivity to unreasonable inputs, this often comes at the cost of generating overly long and meaningless responses that fail to converge. We further find that prompting and fine-tuning enhance the detection of unreasonable inputs, with minor and acceptable trade-offs, making them practical solutions in this challenging setting.

AAAI 2026

Large Language Models Struggle with Unreasonability in Math Problems

nlp: safety and robustness nlp: applications nlp: interpretability

and evaluation of nlp models

analysis

Large Language Models (LLMs) have shown remarkable success on a wide range of math and reasoning benchmarks. However, we observe that they often struggle when faced with unreasonable math problems. Instead of recognizing these issues, models frequently proceed as if the problem is well-posed, producing incorrect answers or falling into overthinking and verbose self-correction. To systematically investigate this overlooked vulnerability, we propose the Unreasonable Math Problems (UMP) benchmark, designed to evaluate LLMs' ability to detect and respond to unreasonable math problem statements. Based on extensive experiments covering 19 LLMs, we find that even state-of-the-art general models like GPT-4o struggle on UMP. While reasoning models such as DeepSeek-R1 demonstrate a higher sensitivity to unreasonable inputs, this often comes at the cost of generating overly long and meaningless responses that fail to converge. We further find that prompting and fine-tuning enhance the detection of unreasonable inputs, with minor and acceptable trade-offs, making them practical solutions in this challenging setting.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Multi-armed bandit algorithms are fundamental tools for sequential decision-making under uncertainty, with widespread applications across domains such as clinical trials and personalized decision-making. As bandit algorithms are increasingly deployed in these socially sensitive settings, it becomes critical to protect user data privacy and ensure fair treatment across decision rounds. While prior work has independently addressed privacy and fairness in bandit settings, the question of whether both objectives can be achieved simultaneously has remained largely open. Existing privacy-preserving bandit algorithms typically optimize average regret, a utilitarian measure, whereas fairness-aware approaches focus on minimizing Nash regret, which penalizes inequitable reward distributions, but often disregard privacy concerns.

To bridge this gap, we introduce Differentially Private Nash Confidence Bound (DP-NCB)—a novel and unified algorithmic framework that simultaneously ensures $\epsilon$-differential privacy and achieves order-optimal Nash regret, matching known lower bounds up to logarithmic factors. The framework is sufficiently general to operate under both global and local differential privacy models, and is anytime, requiring no prior knowledge of the time horizon. We support our theoretical guarantees with simulations on synthetic bandit instances, showing that DP-NCB incurs substantially lower Nash regret than state-of-the-art baselines. Our results offer a principled foundation for designing bandit algorithms that are both privacy-preserving and fair, making them suitable for high-stakes, socially impactful applications.

DP-NCB: Privacy Preserving Fair Bandits

We introduce SampurNER, a fine-grained named entity recognition (FgNER) dataset encompassing all 22 scheduled Indian languages spoken by more than two billion people across various countries. While manual annotation for FgNER resources is often labor-intensive and expensive, distant supervision methods have been employed as a viable solution. However, such datasets are often noisy, with entity mentions tagged with multiple types, requiring computationally intensive noise-aware models for effective FgNER. Moreover, resources for both coarse-grained and fine-grained named entity recognition tasks in Indian languages remain scarce. To address this, we propose an entity-anchored machine translation framework that leverages the largest manually annotated English FgNER dataset, FewNERD, to create a large-scale FgNER dataset in 22 languages. On average, the dataset comprises over 153k sentences, 354k entities, and 3.3M tokens in each language. The languages covered are: Assamese (as), Bengali (bn), Bodo (brx), Dogri (doi), Gujarati (gu), Hindi (hi), Kannada (kn), Kashmiri (ks), Konkani (gom), Maithili (mai), Malayalam (ml), Manipuri (mni), Marathi (mr), Nepali (ne), Odia (or), Punjabi (pa), Sanskrit (sa), Santali (sat), Sindhi (sd), Tamil (ta), Telugu (te), and Urdu (ur). Various rigorous analyses and human evaluations confirm the high quality of the dataset and demonstrate the effectiveness of the entity-anchored machine translation framework with up to 9% increase in F1-score against the current state-of-the-art. Additionally, we extend our analysis to zero-shot, multilingual, and cross-lingual settings, investigating the influence of language family and script similarity on cross-lingual FgNER performance.

SampurNER: Fine-Grained Named Entity Recognition Dataset for 22 Indian Languages

Integrating LiDAR and camera information in the bird’s eye view (BEV) representation has demonstrated its effectiveness in 3D object detection. However, owing to fundamental disparity in geometric and localization accuracy between these sensors, indiscriminate fusion in previous methods often leads to degraded performance. In this paper, we propose BEVDilation, a novel LiDAR-centric framework that prioritizes LiDAR information in the fusion. By formulating image BEV features as implicit guidance rather than naive concatenation, our strategy effectively alleviates the spatial misalignment caused by image depth estimation errors. Furthermore, the effective image guidance allows the LiDAR-centric paradigm to address the sparsity and semantic limitations of point clouds. Specifically, we propose a Sparse Voxel Dilation Block that mitigates the inherent point sparsity by densifying foreground voxel features through image priors. Moreover, we introduce a Semantic-Guided BEV Dilation Block to enhance the LiDAR feature diffusion processing with image semantic guidance and long-range context capture. On the challenging nuScenes benchmark, BEVDilation achieves better performance than state-of-the-art methods while maintaining competitive computational efficiency. Importantly, our LiDAR-centric strategy demonstrates greater robustness to depth noise compared to naive fusion. Our code will be released.

BEVDilation: LiDAR-Centric Multi-Modal Fusion for 3D Object Detection

Three-dimensional scene reconstruction from sparse-view satellite images is a long-standing and challenging task. While 3D Gaussian Splatting (3DGS) and its variants have recently attracted attention for its high efficiency, existing methods remain unsuitable for satellite images due to incompatibility with rational polynomial coefficient (RPC) models and limited generalization capability. Recent advances in generalizable 3DGS approaches show potential, but they perform poorly on multi-temporal sparse satellite images due to limited geometric constraints, transient objects, and radiometric inconsistencies. To address these limitations, we propose SkySplat, a novel self-supervised framework that integrates the RPC model into the generalizable 3DGS pipeline, enabling more effective use of sparse geometric cues for improved reconstruction. SkySplat relies only on RGB images and radiometric-robust relative height supervision, thereby eliminating the need for ground-truth height maps. Key components include a Cross-Self Consistency Module (CSCM), which mitigates transient object interference via consistency-based masking, and a multi-view consistency aggregation strategy that refines reconstruction results. Compared to per-scene optimization methods, SkySplat achieves an 86 times speedup over EOGS with higher accuracy. It also outperforms generalizable 3DGS baselines, reducing MAE from 13.18 m to 1.80 m on the DFC19 dataset significantly, and demonstrates strong cross-dataset generalization on the MVS3D benchmark.

SkySplat: Generalizable 3D Gaussian Splatting from Multi-Temporal Sparse Satellite Images

Graph Neural Networks (GNNs) have emerged as powerful tools for learning over graph-structured data, yet recent studies have shown that their performance gains are beginning to plateau. In many cases, well-established models such as GCN and GAT, when appropriately tuned, can match or even exceed the performance of more complex, state-of-the-art architectures. This trend highlights a key limitation in the current landscape: the difficulty of selecting the most suitable model for a given graph task or dataset. To address this, we propose Self-Adaptive Graph Mixture of Models (SAGMM), a modular and practical framework that learns to automatically select and combine the most appropriate GNN models from a diverse pool of architectures. Unlike prior mixture-of-experts approaches that rely on variations of a single base model, SAGMM leverages architectural diversity and a topology-aware attention gating mechanism to adaptively assign experts to each node based on the structure of the input graph. To improve efficiency, SAGMM includes a pruning mechanism that reduces the number of active experts during training and inference without compromising performance. We also explore a training-efficient variant in which expert models are pretrained and frozen, and only the gating and task-specific layers are trained. We evaluate SAGMM on 16 benchmark datasets covering node classification, graph classification, regression, and link prediction tasks, and demonstrate that it consistently outperforms or matches leading GNN baselines and prior mixture-based methods, offering a robust and adaptive solution for real-world graph learning.

Self-Adaptive Graph Mixture of Models

Cross-modal retrieval aims to align different modalities via semantic similarity. However, existing methods often assume that image-text pairs are perfectly aligned, overlooking Noisy Correspondences in real data. These misaligned pairs misguide similarity learning and degrade retrieval performance. Previous methods often rely on coarse-grained categorizations that simply divide data into clean and noisy samples, overlooking the intrinsic diversity within noisy instances. Moreover, they typically apply uniform training strategies regardless of sample characteristics, resulting in suboptimal sample utilization for model optimization. To address the above challenges, we introduce a novel framework, called Pseudo-label Consistency-Guided Sample Refinement (PCSR), which enhances correspondence reliability by explicitly dividing samples based on pseudo-label consistency. Specifically, we first employ a confidence-based estimation to distinguish clean and noisy pairs, then refine the noisy pairs via pseudo-label consistency to uncover structurally distinct subsets. We further proposed a Pseudo-label Consistency Score (PCS) to quantify prediction stability, enabling the separation of ambiguous and refinable samples within noisy pairs. Accordingly, we adopt Adaptive Pair Optimization (APO), where ambiguous samples are optimized with robust loss functions and refinable ones are enhanced via text replacement during training. Extensive experiments on CC152K, MS-COCO and Flickr30K validate the effectiveness of our method in improving retrieval robustness under noisy supervision. Our code is available at supplementary materials.

PCSR: Pseudo-label Consistency-Guided Sample Refinement for Noisy Correspondence Learning

Label Distribution Learning (LDL) is an effective machine learning paradigm for addressing label ambiguity, where each sample is annotated with a distribution that conveys rich semantic information. However, during the actual annotation process of label distributions, annotators often exhibit divergent labeling preferences for the same sample. Most existing LDL methods overlook this heterogeneity, assuming that the observed label distribution originates from a single labeling pattern. Such an assumption limits their capacity to manage inter-annotator disagreement and constrains the generalization of the resulting models. To address this issue, we propose, for the first time, a Dirichlet process mixture model (DPMM)-based framework for LDL. This framework leverages nonparametric Bayesian methods to adaptively uncover diverse latent labeling patterns from the data and to accurately model annotator heterogeneity. Specifically, the ground-truth label distribution of each sample is modeled as a weighted mixture of multiple latent components, where a feature-conditioned gating mechanism adaptively controls the contribution of each component. Experimental results demonstrate that the proposed model consistently achieves competitive performance on several widely-used benchmark datasets.

Learning Label Distribution with Dirichlet Process Mixture Model

With the rapid development of generative models, such as generative adversarial networks and diffusion models, the task of face forgery detection has emerged, aiming to identify forged faces in real-world scenarios. A key challenge for current face forgery detection models is improving generalization to unknown forgeries. To address this, we propose ResProto-FD, a framework that constructs residual prototype sets to capture diverse forgery cues and discriminative differences from real faces. Our novel perspective collects prototypes from the most informative residual features generated during training, enabling better representation of various forgery traces and real-vs-fake distinctions. First, we introduce a Visual-Language Residual Learning (VLRL) module based on the CLIP model. This module constructs residual features between image and text embeddings to capture inconsistencies between visual features and associated textual semantics. In doing so, it guides the model to attend to subtle visual forgery clues and enhances the discriminative power of image representations. Furthermore, we design a Gradient-aware Residual Prototypes (GRP) mechanism— a dynamic collection strategy that selectively stores uncertain residual features based on gradient signals to build the prototype sets. This enhances the model’s ability to generalize to unknown forgery types. Extensive experiments across various datasets and forgery methods demonstrate that ResProto-FD significantly improves generalization performance and consistently outperforms state-of-the-art methods.

ResProto-FD: Visual-Language Residual Prototype Sets for Generalized Face Forgery Detection

Recent advances in point cloud analysis have increasingly leveraged large-scale unlabeled data through self-supervised representation learning. Autoregressive models based on next-token prediction have shown strong performance, but they usually model point clouds as linear sequences, ignoring their inherent spatial structure. To address this limitation, we propose PointChain, a novel autoregressive paradigm inspired by human perception mechanisms, designed to better align with the structural properties of point cloud. Specifically, we introduce structural chain encoding, which models the understanding process as a global-to-local structural chain inference, preserving spatial relationships throughout the prediction sequence. During pre-training, we design two auxiliary tasks: a next-scale prediction task that encourages cross-scale reasoning, and a scale-level contrastive learning task that promotes semantic consistency across scales. These components guide the model to learn more discriminative and generalizable point cloud representations. Experiments on multiple benchmarks, using both Transformer and Mamba backbones, validate the effectiveness of our approach. PointChain achieves state-of-the-art performance on several downstream tasks, including 93.75% accuracy on the hardest split of ScanObjectNN without voting strategy.

PointChain: Learning Generalizable Point Cloud Representations via Structural Chain Modeling

As a challenging vision-language task, Zero-Shot Composed Image Retrieval (ZS-CIR) is designed to retrieve target images using bi-modal (image+text) queries. Typical ZS-CIR methods employ an inversion network to generate pseudo-word tokens that effectively represent the input semantics. However, the inversion-based methods suffer from two inherent issues: First, the task discrepancy exists because inversion training and CIR inference involve different objectives. Second, the modality discrepancy arises from the input feature distribution mismatch between training and inference. To this end, we propose a lightweight post-hoc framework, consisting of two components: (1) A new text-anchored triplet construction pipeline leverages a large language model (LLM) to transform a standard image-text dataset into a triplet dataset, where a textual description serves as the target of each triplet. (2) The MoTa-Adapter, a novel parameter-efficient fine-tuning method, adapts the dual encoder to the CIR task using our constructed triplet data. Specifically, on the text side, multiple sets of learnable task prompts are integrated via a Mixture-of-Experts (MoE) layer to capture task-specific priors and handle different types of modifications. On the image side, MoTa-Adapter modulates the inversion network's input to better match the downstream text encoder. In addition, an entropy-based optimization strategy is proposed to assign greater weight to challenging samples, thus ensuring efficient adaptation. Experiments show that, with the incorporation of our proposed components, inversion-based methods achieve significant improvements, reaching state-of-the-art performance across four widely-used benchmarks. All data and code will be made publicly available.

Content not yet available

Next from AAAI 2026

DP-NCB: Privacy Preserving Fair Bandits

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES