Singapore

The increased use of large language models (LLMs) in sensitive domains leads to growing interest in how their confidence scores correspond to fairness and bias. This study examines the alignment between LLM-predicted confidence and human-annotated bias judgments. Focusing on gender bias, the research investigates probability confidence calibration in contexts involving gendered pronoun resolution. The goal is to evaluate if calibration metrics based on predicted confidence scores effectively capture fairness-related disparities in LLMs. The results show that, among the six state-of-the-art models, Gemma-2 demonstrates the worst calibration according to the gender bias benchmark. The primary contribution of this work is a fairness-aware evaluation of LLMs confidence calibration, offering guidance for ethical deployment. In addition, we introduce a new calibration metric, Gender-ECE, designed to measure gender disparities in resolution tasks.

AAAI 2026

The Confidence Trap: Gender Bias and Predictive Certainty in LLMs

and fairness

ethics

bias

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Yaqing Wang’s research focuses on generalizing from a few
examples, aiming to build data-efficient, adaptive, and
explainable AI. Her early work established a unifying
framework for few-shot learning, which highlighted the
challenges of unreliable learning under sparse data and
articulated three canonical scenarios—scientific scarcity,
cold-start personalization, and annotation efficiency.
Building on this foundation, she has developed algorithms
addressing key real-world challenges: molecular property
prediction and drug–drug interaction under limited data in
drug discovery, recommendation models that overcome
cold-start issues and are deployed in large-scale
platforms, and efficient methods for intent recognition and
gesture sensing where annotation or interaction is costly.
Her recent work explores the synergy between meta-learning
and in-context learning, and introduces personalized agents
that adapt to user preferences with only a handful of
interactions. These contributions reflect her continued
efforts toward advancing few-shot learning in both theory
and practice, with growing impact in AI for science and
personalization.

From Few-Shot Learning to Data-Efficient Intelligence

Object state understanding aims at recognizing the co-occurrence and transitions of multiple object states in videos. While learning from videos handles seen object states well, it struggles with novel ones. We address this task in a zero-shot setting by extracting state-specific knowledge from pre-trained models and using Vision-Language Models (VLMs) to verify whether such knowledge is visually grounded in videos. However, the extracted knowledge varies in its ability to distinguish states, and VLM observations are not always trustworthy. To address this issue, we propose a trust-aware knowledge-guided method to model knowledge trustworthiness and emphasize highly discriminative knowledge that VLMs can reliably observe. Specifically, we collect spatial knowledge for each object state from retrieved images and cues generated from a Large Language Model, then use VLMs to vote on each knowledge element by scoring its visual consistency with the video. In addition to a single scene, temporal dependencies of object states across scenes are also captured using a generative VLM. Under spatial and temporal constraints, we propose an adaptive knowledge refinement module that iteratively updates knowledge reliability weights to achieve a global consensus in object state inference across the video. Finally, object states are inferred by combining the refined weights with VLM voting results. Experiments on two datasets demonstrate the effectiveness of our method.

What to Trust? A Trust-aware Knowledge-guided Method for Zero-shot Object State Understanding in Videos

Multimodal remote sensing image joint classification has achieved significant progress. However, existing methods primarily focus on designing modality-specific networks, lacking adaptive generalization capabilities in diverse and dynamic modality combinations encountered in real-world scenarios. Inspired by the generalization capabilities of visual foundation model in downstream tasks, we propose a unified Text-guided Arbitrary Modalitiy Prompting (T-APT) framework, which leverages complementary fused features to drive the foundation model and employs text-guided modality-specific prior knowledge as cross-modal prompts to fine-tune a pretrained Vision Transformer (ViT) model. Specifically, a Mamba-Based Arbitrary Modal-Focused Feature Capture (MAMF-FC) module is designed to extract complementary joint features and modality-specific prior knowledge from arbitrary modalities through a shared-specific scanning encoder-decoder architecture. Subsequently, a Text-Guided Modality-Aware Prompt Tuning (TMPT) module is proposed to support the adaptation of fused features to the foundation model, enabling our arbitrary remote sensing image classification task. Extensive experiments on public datasets spanning multispectral (MS), hyperspectral (HS), light detection and ranging (LiDAR), and synthetic aperture radar (SAR) modalities demonstrate that our T-APT achieves classification performance comparable to specialized networks across arbitrary modal combinations.

T-APT: Text-Guided Modality-Aware Prompt Tuning for Arbitrary Multimodal Remote Sensing Data Joint Classification

Recent advances in large language models (LLMs) have broadened their applicability across diverse tasks, yet specialized domains still require targeted post-training. Among existing methods, Group Relative Policy Optimization (GRPO) stands out for its efficiency, leveraging groupwise relative rewards while avoiding costly value function learning. However, GRPO treats candidate responses as independent, overlooking semantic interactions such as complementarity and contradiction. To address this challenge, we first introduce a Structural Causal Model (SCM) that reveals hidden dependencies among candidate responses induced by conditioning on a final integrated output—forming a collider structure. Then, our causal analysis leads to two insights: (1) projecting responses onto a causally-informed subspace improves prediction quality, and (2) this projection yields a better baseline than query-only conditioning. Building on these insights, we propose Group Causal Policy Optimization (GCPO), which integrates causal structure into optimization through two key components: a causally-informed reward adjustment and a novel KL-regularization term that aligns the policy with a causally-projected reference distribution. Comprehensive experimental evaluations demonstrate that GCPO consistently surpasses existing methods—including GRPO—across multiple reasoning benchmarks.

Group Causal Policy Optimization for Post-Training Large Language Models

Hyperedge prediction plays a central role in hypergraph learning, enabling the inference of high-order relations among multiple entities. However, existing methods often rely on a simplistic \emph{flat set assumption}, treating candidate hyperedges as unstructured collections of nodes and neglecting their potential internal compositionality. Furthermore, the severe scarcity of observed hyperedges poses a challenge for effective supervision. In this work, we propose **S$^3$Hyper**, a **S**ubstructure-contextualized **S**elf-**S**upervised framework for **Hyper**edge prediction, which jointly addresses these two challenges. Specifically, we design a substructure-contextualized hyperedge aggregator that models the internal hierarchy of candidate hyperedges by leveraging sub-hyperedge information. In parallel, we introduce an adaptive tri-directional contrastive learning module that incorporates node-level, hyperedge-level, and cross-level alignment objectives, supported by temperature-adaptive mechanisms. Experimental results on four public datasets demonstrate that S$^3$Hyper consistently outperforms strong baselines, with ablation studies verifying the effectiveness of each component.

Self-Supervised Hypergraph Learning with Substructure Awareness for Hyperedge Prediction

Recent advancements in end-to-end autonomous driving systems (ADSs) underscore their potential for perception and planning capabilities. However, challenges remain. Complex driving scenarios contain rich semantic information, yet ambiguous or noisy semantics can compromise decision reliability, while interference between multiple driving tasks may hinder optimal planning. Furthermore, prolonged inference latency slows decision-making, increasing the risk of unsafe driving behaviors. To address these challenges, we propose ExpertAD, a novel framework that enhances the performance of ADS with Mixture of Experts (MoE) architecture. We introduce a Perception Adapter (PA) to amplify task-critical features, ensuring contextually relevant scene understanding, and a Mixture of Sparse Experts (MoSE) to minimize task interference during prediction, allowing for effective and efficient planning. Our experiments show that ExpertAD reduces average collision rates by up to 20% and inference latency by 25% compared to prior methods. We further evaluate its multi-skill planning capabilities in rare scenarios (e.g., accidents, yielding to emergency vehicles) and demonstrate strong generalization to unseen urban environments. Additionally, we present a case study that illustrates its decision-making process in complex driving scenarios. Codes are included in the supplementary material.

ExpertAD: Enhancing Autonomous Driving Systems with Mixture of Experts

News image captioning aims to produce journalistically informative descriptions by combining visual content with contextual cues from associated articles. Despite recent advances, existing methods struggle with three key challenges: (1) incomplete information coverage, (2) weak cross-modal alignment, and (3) suboptimal visual-entity grounding. To address these issues, we introduce \textbf{MERGE}, the first \textbf{M}ultimodal \textbf{E}ntity-aware \textbf{R}etrieval-augmented \textbf{GE}neration framework for news image captioning. MERGE constructs an entity-centric multimodal knowledge base (EMKB) that integrates textual, visual, and structured knowledge, enabling enriched background retrieval. It improves cross-modal alignment through a multistage hypothesis-caption strategy and enhances visual-entity matching via dynamic retrieval guided by image content. Extensive experiments on GoodNews and NYTimes800k show that MERGE significantly outperforms state-of-the-art baselines, with CIDEr gains of +6.84 and +1.16 in caption quality, and F1-score improvements of +4.14 and +2.64 in named entity recognition. Notably, MERGE also generalizes well to the unseen Visual News dataset, achieving +20.17 in CIDEr and +6.22 in F1-score, demonstrating strong robustness and domain adaptability.

Knowledge Completes the Vision: A Multimodal Entity-aware Retrieval-Augmented Generation Framework for News Image Captioning

To efficiently solve exact discrete optimization problems, branch and bound algorithms require tight bounds. In constraint programming, for optimization, soft arc consistencies typically derive much stronger bounds than those offered by domain or bound consistencies applied to a cost variable. The reason is that soft local consistencies exchange marginal cost information between variables whereas domain consistencies rely only on shrinking domains, which is less informative. However, CP solvers equipped with soft arc consistencies have so far offered limited support for efficient global constraints processing.

In this work, we show how we can efficiently enforce soft local consistency over the AllDifferent constraint, relying on algorithms for the Linear Assignment problem (LAP). We implement this propagator in toulbar2, the state-of-the-art weighted CP solver exploiting soft local consistencies for bounding. On problems that include AllDifferent constraints, we show that, equipped with this new propagator, toulbar2 outperforms state-of-the-art domain consistency-based CP as well as integer programming solvers for the Quadratic Assignment Problem and shows better overall performance in the miniCOP track of the 2024 XCSP competition.

Assignment Problems in Cost Function Networks

Public health experts need scalable methods to monitor large volumes of health data (e.g., human-reported cases, hospitalizations, deaths). These methods must identify individual data points that may indicate significant events, such as outbreaks, or reveal data quality issues. Identifying, triaging, and analyzing these data points in real-time is critical for preventing downstream errors in forecasting or policy. Traditional alert-based data monitoring systems, used for decades in practice, fail to identify relevant data events for several reasons. For example, these systems may not output real-time results from large data volumes, or they may return tens of thousands of unhelpful alerts. 

We introduce a human-in-the-loop AI system for public health data monitoring that uses a ranking-based AI anomaly detection method. This system was developed through a multi-year interdisciplinary collaboration with participatory design from researchers, engineers, and public health data experts. From this process, we identified system goals, such as user control and efficiency and designed a system that balances these goals. This system has since been deployed at a national public health organization and analyzes up to 5 million data points daily. A three-month longitudinal deployment evaluation revealed a significant improvement in system goals, including a 54x increase in data reviewer efficiency and increased engagement compared to traditional alert-based methods.

Reducing Alert Fatigue Through AI Ranking: A Deployed Public Health Data Monitoring System

Human trafficking, affecting over 50 million people globally, is a complex criminal enterprise in which traffickers actively conceal and distribute information across fragmented and often illicit online platforms. Traditional investigative tools are ill-suited for detecting patterns across such obfuscated, heterogeneous data. This paper presents Domain-specific Insight Graphs (DIG), an investigative AI search engine designed to operate at web scale and enable non-technical decision-makers, such as law enforcement and prosecutors, to rapidly uncover actionable leads in human trafficking investigations. DIG employs a novel AI pipeline that ingests large, diverse web corpora (including trafficking-relevant advertisements), cleans and normalizes extracted information, and links entities into a semantic knowledge graph. A domain-optimized search layer allows investigators to traverse these graphs to identify potential victims, perpetrators, and trafficking networks. Unlike commercial alternatives, DIG was released free of charge, open-sourced, and deployed to over 200 U.S. state and local law enforcement agencies through the DARPA Memex program. Deployment results demonstrate measurable impact: in New York, agencies using DIG reported a drop in sex worker arrests and an increase in trafficking-related arrests from <1% to over 60%, disrupting cycles of victim re-victimization. The system has been credited in high-profile prosecutions and received endorsements from District Attorneys. This paper details the problem context, AI approach, deployment process, operational challenges, and lessons learned from maintaining DIG post-federal funding, including navigating intellectual property for open release and sustaining the system via philanthropic support. DIG exemplifies how AI-driven investigative tools can deliver lasting societal benefit through targeted, innovative application in high-stakes domains.

Downloads

Next from AAAI 2026

From Few-Shot Learning to Data-Efficient Intelligence

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES