Singapore

Multiple-choices question answering (MCQA) has emerged as one of the most popular task formats for large language models (LLMs) evaluation. Unfortunately, there exist substantial evidence that the evaluation of current MCQA benchmarks suffers from significant answer bias, which severely undermines the reliability of the evaluation conclusions. Specifically, many LLMs achieve performance significantly higher than random selection even when the questions are omitted from input information. To this end, we conduct a systematic investigation of the attribution of answer bias, and demonstrate a strong correlation between the degree of data contamination and the severity of answer bias, while the position of options and the popularity of answers have relatively minor effects. Building on these insights, we further propose OPD, a straightforward yet effective tool for contamination detection and dataset debiasing without requiring access to the model’s internal training data. Our findings and algorithms provide valuable insights for the design of future trustworthy LLM evaluation protocols.

AAAI 2026

Does Question Really Matter? The Attribution of Answer Bias in LLM Evaluation

answer bias

data contamination

language model

evaluation

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

While large language models (LLMs) have demonstrated remarkable performance on high-level semantic tasks, they often struggle with fine-grained, token-level understanding and structural reasoning—capabilities that are essential for applications requiring precision and control. We introduce TASE, a comprehensive benchmark designed to evaluate LLMs' ability to perceive and reason about token-level information across languages. TASE covers 10 tasks under two core categories: token awareness and structural understanding, spanning Chinese, English, and Korean, with a 35,927-instance evaluation set and a scalable synthetic data generation pipeline for training. Tasks include character counting, token alignment, syntactic structure parsing, and length constraint satisfaction. We evaluate over 30 leading commercial and open-source LLMs, including O3, Claude 4, Gemini 2.5 Pro, and DeepSeek-R1, and train a custom Qwen2.5-14B model using the GRPO training method. Results show that human performance significantly outpaces current LLMs, revealing persistent weaknesses in token-level reasoning. TASE sheds light on these limitations and provides a new diagnostic lens for future improvements in low-level language understanding and cross-lingual generalization.We will release our code and dataset.

TASE: Token Awareness and Structured Evaluation for Multilingual Language Models

We propose Knowledge Boundary Discovery (KBD), a reinforcement learning based framework to explore the knowledge boundaries of the Large Language Models (LLMs). We define the knowledge boundary by automatically generating two types of questions: (i) those the LLM can confidently answer within-knowledge boundary and (ii) those it cannot beyond-knowledge boundary. Iteratively exploring and exploiting the LLM's responses to find its knowledge boundaries is challenging because of the hallucination phenomenon. To find the knowledge boundaries of an LLM, the agent interacts with the LLM under the modeling of exploring a partially observable environment. The agent generates a progressive question as the action, adopts an entropy reduction as the reward, receives the LLM's response as the observation and updates its belief states. We demonstrate that the KBD detects knowledge boundaries of LLMs by automatically finding a set of non-trivial answerable and unanswerable questions. We validate the KBD by comparing its generated knowledge boundaries with manually crafted LLM benchmark datasets. Experiments show that our KBD-generated question set is comparable to the human-generated datasets. Our approach paves a new way to evaluate LLMs.

Knowledge Boundary Discovery for Large Language Models

Pre-trained language models (PLMs) have shown strong potential in Ethereum account modeling and fraud detection. However, existing approaches often overlook the graph-structured nature of transaction networks. In addition, they struggle with the long-tail distribution of account activity, resulting in anisotropic embedding spaces and poor representation quality for low-frequency accounts. In this paper, we present IGT4ETH, a pre-trained Graph Transformer with an isotropy-enhanced post-processing, which explicitly models transaction topologies and mitigates representational anisotropy for Ethereum account classification. IGT4ETH improves structural representation by incorporating structural centrality and role embeddings into an Edge-augmented Graph Transformer, effectively capturing both topological and interaction patterns in transaction graphs. To further mitigate embedding anisotropy, we systematically evaluate various post-processing techniques. Among them, we adopt the Conceptor Negation (CN) method to softly suppress latent features dominated by high-frequency words via matrix conceptors, alongside a modified Focal-InfoNCE loss to enhance directional uniformity and representation balance. Extensive experiments on four real-world Ethereum account classification tasks, including phishing, exchange, mining, and ICO-wallet classification, demonstrate that IGT4ETH consistently outperforms state-of-the-art PLM-based baselines in terms of classification performance.

IGT4ETH: An Isotropic Pre-trained Graph Transformer for Ethereum Account Classification

With the rise of 3D Gaussian Splatting (3DGS), a variety of digital watermarking techniques, embedding either 1D bitstreams or 2D images, are used for copyright protection. However, the robustness of these watermarking techniques against potential attacks remains underexplored. This paper introduces the first universal black-box attack framework, the Group-based Multi-objective Evolutionary Attack (GMEA), designed to challenge these watermarking systems. We formulate the attack as a large-scale multi-objective optimization problem, balancing watermark removal with visual quality. In a black-box setting, we introduce an indirect objective function that blinds the watermark detector by minimizing the standard deviation of features extracted by a convolutional network, thus rendering the feature maps uninformative. To manage the vast search space of 3DGS models, we employ a group-based optimization strategy to partition the model into multiple, independent sub-optimization problems. Experiments demonstrate that our framework effectively removes both 1D and 2D watermarks from mainstream 3DGS watermarking methods while maintaining high visual fidelity. This work reveals critical vulnerabilities in existing 3DGS copyright protection schemes and calls for the development of more robust watermarking systems.

Fading the Digital Ink: A Universal Black-Box Attack Framework for 3DGS Watermarking Systems

Multi-view clustering has been found useful to leverage diverse data sources for accurate and robust underlying data representations. It typically relies on effectively integrating the latent features from different views through allocating weights while simultaneously mining their specificity and consensus information. However, it remains open how to achieve a more fine-grained sample-level weight allocation for promoting view-specific information fusion and view-shared consensus. To address this problem, we propose a novel multi-expert learning framework named Gated Variational Graph AutoEncoder with Competition and Consensus (GVGAE-$\text{C}^{2}$). In particular, it employs multiple view-specific Variational Graph AutoEncoders (VGAEs) as experts to capture the latent features from their own views. Furthermore, we design a fine-grained structure-aware gating network, which dynamically computes sample-level weights based on the proposed structure-aware quality evaluation on each expert, thus facilitating competition among experts. Meanwhile, each expert is trained not only to study its assigned view's specificity features, but also explicitly encouraged to learn consensus-aware features across views. Extensive multi-view clustering experiments on benchmark datasets reveal that GVGAE-$\text{C}^{2}$ significantly outperforms state-of-the-art methods.

Gated Variational Graph Autoencoders as Experts with Competition and Consensus for Multi-view Clustering

In offline-to-online (O2O) reinforcement learning, achieving efficient performance improvement while maintaining training stability remains a critical challenge for effective fine-tuning. Existing O2O methods usually focus on the balance between policy improvement and policy constraint during online fine-tuning. However, they often overlook sample differences, leading to suboptimal performance. To address this challenge, we identify that the effectiveness of policy learning exhibits significant variation across states. Therefore, we propose the notion of state proficiency to capture the degree of effective learning in a given state. We propose State Proficiency-Based Adaptive Fine-Tuning (SPA), a straightforward yet effective method that establishes proficiency-based sample priorities in policy optimization to facilitate effective fine-tuning. Specifically, SPA focuses on low proficiency samples during policy improvement to enhance sample efficiency, while emphasizing high proficiency samples during policy constraint to ensure stable training. Extensive empirical results demonstrate that SPA achieves significant improvements over existing methods, attaining state-of-the-art performance on the D4RL benchmark.

State Proficiency-Based Adaptive Fine-Tuning for Offline-to-Online Reinforcement Learning

Biologically plausible and energy-efficient frameworks such as Spiking Neural Networks (SNNs) have not been sufficiently explored in low-level vision tasks. Taking image deraining as an example, this study addresses the representation of the inherent high-pass characteristics of spiking neurons, specifically in image deraining and innovatively proposes the Visual LIF (VLIF) neuron, overcoming the obstacle of lacking spatial contextual understanding present in traditional spiking neurons. To tackle the limitation of frequency-domain saturation inherent in conventional spiking neurons, we leverage the proposed VLIF to introduce the Spiking Decomposition and Enhancement Module and the lightweight Spiking Multi-scale Unit for hierarchical multi-scale representation learning. Extensive experiments across five benchmark deraining datasets demonstrate that our approach significantly outperforms state-of-the-art SNN-based deraining methods, achieving this superior performance with only 13% of their energy consumption. These findings establish a solid foundation for deploying SNNs in high-performance, energy-efficient low-level vision tasks. The code is in the supplementary material and will be publicly released.

Exploring the Potentials of Spiking Neural Networks for Image Deraining

Real-time 3D reconstruction is crucial for robotics and augmented reality, yet current simultaneous localization and mapping(SLAM) approaches often struggle to maintain structural consistency and robust pose estimation in the presence of depth noise. This work introduces PointSLAM++, a novel RGB-D SLAM system that leverages a hierarchically constrained neural Gaussian representation to preserve structural relationships while generating Gaussian primitives for scene mapping. It also employs progressive pose optimization to mitigate depth sensor noise, significantly enhancing localization accuracy. Furthermore, it utilizes a dynamic neural representation graph that adjusts the distribution of Gaussian nodes based on local geometric complexity, enabling the map to adapt to intricate scene details in real time. This combination yields high-precision 3D mapping and photorealistic scene rendering. Experimental results show PointSLAM++ outperforms existing 3DGS-based SLAM methods in reconstruction accuracy and rendering quality, demonstrating its advantages for large-scale AR and robotics.

PointSLAM++: Robust Dense Neural Gaussian Point Cloud-based SLAM

In the field of multi-spectral object re-identification (ReID), 
multi-modal knowledge and modal-specific knowledge exhibit complementary advantages when handling hard samples, but existing methods rarely integrate this collaborative information.
Knowledge distillation is a direct approach for transferring information, however, heterogeneity in model architectures and variations in sample hardness can undermine the stability and controllability of knowledge transfer.
To alleviate these limitations, we propose the novel Progressive Multi-modal Knowledge Distillation (PMKD) framework that enables multi-stage knowledge transfer guided by hard sample awareness. 
In the multi-modal knowledge transfer stage, the source model (pre-trained on multi-modal data) disseminates its learned multi-modal collaborative knowledge to multiple independently modal-specific target models, guiding their adaptation to hard samples within training batches. 
In the modal-specific knowledge retention stage, the independent models enriched with multi-modal knowledge guide the training phase. The architectural consistency between source-target models ensures more lossless knowledge transfer, effectively mitigating the risk of capability drift, and preserving inherent competence.
Moreover, the entire progressive multi-modal knowledge distillation is regulated by the proposed hardness-aware distillation loss, which automatically adapts distillation intensity through hard sample mining, thereby ensuring stable transfer of hard sample handling capabilities.
Extensive experiments on benchmark multi-spectral ReID datasets validate the effectiveness and superior performance of the proposed method.

Progressive Multi-modal Knowledge Distillation for Multi-spectral Object Re-identification

Vision-Language Models (VLMs), with their powerful content generation capabilities, have been successfully applied to data annotation processes. However, the VLM-generated labels exhibit dual limitations: low quality (i.e., label noise) and absence of error correction mechanisms. To enhance label quality, we propose Human-Corrected Labels (HCLs), a novel setting that efficient human correction for VLM-generated noisy labels. As shown in Figure 1(b), HCL strategically deploys human correction only for instances with VLM discrepancies, achieving both higher-quality annotations and reduced labor costs. Specifically, we theoretically derive a risk-consistent estimator that incorporates both human-corrected labels and VLM predictions to train classifiers. Besides, we further propose a conditional probability method to estimate the label distribution using a combination of VLM outputs and model predictions. Extensive experiments demonstrate that our approach achieves superior classification performance and is robust to label noise, validating the effectiveness of HCL in practical weak supervision scenarios.

Downloads

Next from AAAI 2026

TASE: Token Awareness and Structured Evaluation for Multilingual Language Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

TASE: Token Awareness and Structured Evaluation for Multilingual Language Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads