Singapore

Benchmarks serve as standardized test systems to distinguish capabilities among large language models (LLMs). Discriminative items enable high-ability LLMs to favor correct answers, while causing low-ability models to assign lower plausibility to these answers and tend toward incorrect answers. Current methods for assessing benchmark quality primarily focus on coverage of difficulty levels and task diversity, yet lack direct quantification of discrimination—the core metric. Furthermore, large-scale benchmarks incur high evaluation costs. Although heuristic methods can reduce item counts to some extent, they cannot guarantee preservation of the benchmark’s original discriminative properties. To address these limitations, we propose MetaEval, a meta-evaluation framework designed to precisely quantify per-item discrimination and enable efficient assessment. Central to MetaEval is our novel Signal Detection and Item Response (SD-IR) model, which simulates LLMs’ detection of correct answers (signals) by representing each model’s perception through two latent ability states: “known” and “unknown”. For any item, discrimination is quantified as the difference in signal plausibility between these states. Leveraging these discrimination metrics, MetaEval introduces two strategies to replicate full-benchmark results using minimal subsets for efficient evaluation: (1) Distilling metaBench: a compact subset that retains discriminative power by removing redundant items; (2) Predicting performance on full-benchmark based on metaBench’s discrimination. Experiments across five benchmarks confirm that high-discrimination items capture greater performance variation among LLMs, align more closely with full-benchmark rankings, and exhibit superior predictive ability. 
Notably, in the best case, MetaEval achieves accurate full-benchmark estimation using only 2.5% of items, substantially reducing evaluation costs while preserving reliability.

AAAI 2026

MetaEval: Measuring the Discrimination of Benchmarks for Efficient LLM Evaluation

and evaluation of nlp models nlp: (large) language models cms: other foundations of cognitive modeling & syste

nlp: interpretability

analysis

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The Job-Shop Scheduling Problem (JSSP), under various forms of manufacturing uncertainty, has recently attracted considerable research attention. Most existing studies focus on parameter uncertainty, such as variable processing times, and typically adopt the actor-critic framework. In this paper, we explore a different but prevalent form of uncertainty in JSSP: structural uncertainty. Structural uncertainty arises when a job may follow one of several routing paths, and the selection is determined not by policy, but by situational factors (e.g., the quality of intermediate products) that cannot be known in advance. Existing methods struggle to address this challenge due to incorrect credit assignment: a high-quality action may be unfairly penalized if it is followed by a time-consuming path. To address this problem, we propose a novel method named UP-AAC. In contrast to conventional actor-critic methods, UP-AAC employs an asymmetric architecture. While its actor receives a standard stochastic state, the critic is crucially provided with a deterministic state reconstructed in hindsight. This design allows the critic to learn a more accurate value function, which in turn provides a lower-variance policy gradient to the actor, leading to more stable learning. In addition, we design an attention-based Uncertainty Perception Model (UPM) to enhance the actor's scheduling decisions. Extensive experiments demonstrate that our method outperforms existing approaches in reducing makespan on benchmark instances.

Learning to Optimize Job Shop Scheduling Under Structural Uncertainty

Meta-learning for Bayesian optimization accelerates optimization by leveraging knowledge from previous tasks, but existing methods optimize for average performance and fail on challenging outlier tasks critical in practice. These limitations become particularly severe when target tasks exhibit distribution shifts or when optimization budgets are limited in real-world applications. We introduce MetaGameBO, a hierarchical game-theoretic framework that formulates meta-learning as robust optimization through CVaR-based task selection and diversity-aware sample learning. Our approach incorporates uncertainty-aware adaptation via probabilistic embeddings and Thompson sampling for robust generalization to out-of-distribution targets. We establish theoretical guarantees including convergence to game-theoretic equilibria and improved sample complexity, and demonstrate substantial improvements with 95.7\% reduction in average loss and 88.6\% lower tail risk compared to state-of-the-art methods on challenging tasks and distribution shifts.

MetaGameBO: Hierarchical Game-Theoretic Driven Robust Meta-Learning for Bayesian Optimization

Existing multimodal representation learning approaches often rely on simple feature concatenation or unified transformations, which fail to effectively disentangle and leverage common and private information across different modalities in a progressive manner. Moreover, they typically lack adaptive modeling tailored to specific task requirements. To address these limitations, we propose a Prototype-Induced Label Structuring for Disentangled Multimodal Representation Network (PLUM-Net). It first employs a multilevel semantic alignment module to synchronize global and local semantics across audio, visual and textual streams. On this aligned foundation, a prototype-based single-modal label generation module derives modality-specific hard and soft-labels that subtly steer the network toward a cleaner split between shared and private cues. Guided by these labels, the task-conditioned feature bifurcator module channels information through the most beneficial common or private pathway for the given task, after which a private refinement module polishes and fusion each modality’s idiosyncratic signals. Extensive experiments show that PLUM-Net delivers strong performance on datasets such as CMU-MOSI, CMU-MOSEI and UR-FUNNY, achieving an ACC-2 of 90.3% on CMU-MOSI and 83.2% on UR-FUNNY .

PLUM-Net: Prototype-Induced Label Structuring for Disentangled Multimodal Representation Network

In Bi-Objective Search (BOS), the task is to find all Pareto-optimal paths in a graph where each edge has two cost values. Skyler et. al. (2024) proposed a general algorithmic framework for BOS and divided the search nodes into classes. They also discussed several ordering functions of the objectives that are used to choose nodes for expansion. In this paper, we continue this line of research.
We further refine the classes of nodes and show that many nodes that are classified as never-expand nodes, that were added to the open list, still need to be extracted from the open list and be further examined. Additionally, we introduce a method that enables constant-time dominance checks for the MIN and MAX ordering functions. This allows a practical usage of these ordering functions, as we demonstrate in our experimental section.

Deeper Treatment of the Bi-objective Search Framework

Model merging serves as a training-free technique that combines multiple task-specific models into a unified multi-task model, but parameter conflicts often lead to performance drops. Previous methods flatten weight matrices into one-dimensional vectors, losing the inherent structural information of their row and column spaces. We mathematically prove and experimentally validate that parameter conflicts arise from non-orthogonal components of task vectors, while orthogonal components are conflict-free. Furthermore, we find that non-orthogonal components can contain both harmful conflicts and beneficial synergies. To precisely locate parameter conflicts and extract orthogonal components, we propose GLOBA (GLObal Basis Analysis Framework), which projects task vectors onto a global basis to align them within a unified coordinate system and construct a task interaction matrix. Following energy-based pruning, we divide parameters into five types based on the orthogonal relationships between the row spaces and column spaces of task vectors. Experiments on three fine-tuned models (mathematics, coding, and instruction-following) using LLaMA-2-7B and LLaMA-2-13B demonstrate significant performance gains through selective retention of beneficial parameters and removal of conflicting ones.

GLOBA: Rethinking Parameter Conflicts in Model Merging

Recent advances in multi-instance learning (MIL) have demonstrated impressive performance in whole slide image (WSI) analysis. However, current methods search for cues and draw conclusions from all instances or regions, resulting in excessive redundant computation and suboptimal representation quality due to irrelevant and uninformative feature interference. To address these issues, we propose CICS, an efficient and general framework that performs compact information compression and selection for high-efficiency WSI analysis. In particular, CICS features two key components: (1) context-aware compression (CAC), which partitions the instance space into sub-regions and applies learnable compression to discard irrelevant components, reduce computational complexity while facilitating information selection, and (2) global-proximity selective attention (GPSA), which cherry-picks the most informative representation with a proximity-assisted global dynamic selection strategy. Building upon these innovations, CICS forms a plug-and-play module that reduces computational complexity through compact instance representations while improving feature quality by preserving the most informative cues. Extensive experiments on six WSI classification and survival prediction datasets show that CICS consistently improves the performance of multiple representative MIL methods. It achieves 2.5%, 7.7%, and 3.9% accuracy gain over the state-of-the-art Transformer-based TransMIL, Mamba-based MambaMIL, and graph-based WIKG methods on the ESCA dataset.

Content-aware Information Compression and Selection for Whole Slide Image Analysis

Large Language Models (LLMs) exhibit strong performance across various natural language processing (NLP) tasks but remain vulnerable to hallucinations, generating factually incorrect or misleading outputs. Uncertainty estimation, often using predictive entropy estimation, is key to addressing this issue. However, existing methods often require multiple samples or extra computation to assess semantic entropy. This paper proposes an efficient, training-free uncertainty estimation method that approximates predictive entropy using the responses' top-$K$ probabilities. Moreover, we employ an adaptive mechanism to determine $K$ to enhance flexibility and filter out low-confidence probabilities. Experimental results on three free-form question-answering datasets across several LLMs demonstrate that our method outperforms expensive state-of-the-art baselines, contributing to the broader goal of enhancing LLM trustworthiness.

Probabilities Are All You Need: A Probability-Only Approach to Uncertainty Estimation in Large Language Models

Peptide-based drug design targeting “undruggable” proteins remains one of the most critical challenges in modern drug discovery. Conventional peptide-discovery pipelines rely on low-throughput experimental screening, which is both time-consuming and prohibitively expensive. Moreover, existing computational approaches for designing peptides against target proteins typically depend on the availability of high-quality structural information. Although recent structure-prediction tools such as AlphaFold3 have achieved breakthroughs in protein modeling, their accuracy for functional interfaces remains limited. The acquisition of high-resolution structures is often expensive, time-intensive, and particularly challenging for targets with dynamic conformations, further restricting the efficient development of peptide therapeutics. Additionally, current sequence-based generative methods follow a paradigm that relies on known templates, which limits the exploration of sequence space and results in generated peptides lacking diversity and novelty. To address these limitations, we propose a contrastive conditioned diffusion framework for target-specific peptide generation, referred to as PepCCD. It employs a contrastive learning strategy between proteins and peptides to extract sequence-based conditioning representations of target proteins, which serve as precise conditions to guide a pre-trained diffusion model to generate peptide sequences with the desired target specificity. Extensive experiments on multiple benchmark target proteins demonstrate that the peptides designed by PepCCD exhibit strong binding affinity and outperform state-of-the-art methods in terms of diversity and generation efficiency.

PepCCD: A Contrastive Conditioned Diffusion Framework for Target-Specific Peptide Generation

Multimodal sarcasm detection (MSD) aims to identify sarcasm polarity through diverse modalities (i.e., image-text pairs), which gains increasing attention. While significant advancements have been witnessed, the existing approaches still face two major issues: lack of explainability and weak generalizability. In this paper, we introduce a new large vision-language model (LVLM) dubbed S³-MSD for explainable and generalizable MSD through three key components. For explainability, we develop (1) a self-training paradigm bootstrapping answers with explanations automatically, and (2) a self-calibrating mechanism rectifying flawed explanations. For generalizability, we design (3) a self-focusing module amplifying visual semantic entities through preference optimization, to mitigate textual over-reliance. Experimental results on both in-distribution and out-of-distribution (OOD) benchmarks demonstrate that S³-MSD consistently outperforms state-of-the-art methods in detection performance. Furthermore, the proposed S³-MSD provides persuasive explanations, as validated by quantitative and human evaluations.

S³-MSD: Large Vision-Language Model for Explainable and Generalizable Multi-modal Sarcasm Detection

The AI community has shown substantial interest in the concept of world models: internal representations that simulate aspects of the external world, track entities and states, capture causal relationships, and enable prediction of consequences. This contrasts with representations based solely on statistical correlations. A key motivation behind this research direction is the argument that humans possess such mental world models, and finding evidence of similar representations in AI models might indicate that these models truly "understand" the world in a human-like way. In this paper, we use problems and case studies from the philosophy of science literature to critically examine whether the world model framework adequately characterizes human-level understanding. We focus on specific philosophical analyses where the distinction between world model capabilities and human understanding is most pronounced. While these represent particular views of understanding rather than universal definitions, they illuminate some important limitations in using world models as a lens to claim that AI models understand in a human-like way. By highlighting these distinctions, we hope to stimulate deeper discussion about the nature of understanding in both human and artificial contexts.

Downloads

Next from AAAI 2026

Learning to Optimize Job Shop Scheduling Under Structural Uncertainty

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Learning to Optimize Job Shop Scheduling Under Structural Uncertainty

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads