Singapore

Ensuring fairness in machine learning requires understanding how sensitive attributes like race or gender causally influence outcomes. Existing causal discovery (CD) methods often struggle to recover fairness-relevant pathways in the presence of noise, confounding, or data corruption. Large language models (LLMs) offer a complementary signal by leveraging semantic priors from variable metadata. We propose a hybrid LLM-guided CD framework that extends a breadth-first search strategy with active learning and dynamic scoring. Variable pairs are prioritized for querying using a composite score combining mutual information, partial correlation, and LLM confidence, enabling more efficient and robust structure discovery. To evaluate fairness sensitivity, we introduce a semi-synthetic benchmark based on the UCI Adult dataset, embedding domain-informed bias pathways alongside noise and latent confounders. We assess how well CD methods recover both global graph structure and fairness-critical paths (e.g., sex→education→income). Our results show that LLM-guided methods—including our active, dynamically scored variant—outperform baselines in recovering fairness-relevant structure under noisy conditions. We analyze when LLM-driven insights complement statistical dependencies and discuss implications for fairness auditing in high-stakes domains.

AAAI 2026

Uncovering Bias Paths with LLM-guided Causal Discovery: An Active Learning and Dynamic Scoring Approach

interpretability & explainability

and fairness

(large) language models

causal learning

ethics

accountability

bias

active learning

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The use of Large Language Models (LLMs) in police opera- tions is growing, yet an evaluation framework tailored to po- lice operations remains absent. While LLM’s responses may not always be legally “incorrect”, their unverified use still can lead to severe issues such as unlawful arrests and improper evidence collection. To address this, we propose PAS (Po- lice Action Scenarios), a systematic framework covering the entire evaluation process. Applying this framework, we con- structed a novel QA dataset from over 8,000 official docu- ments and established key metrics validated through statis- tical analysis with police expert judgements. Experimental results show that commercial LLMs struggle with our new police-related tasks, particularly in providing fact-based rec- ommendations. This study highlights the necessity of an ex- pandable evaluation framework to ensure reliable AI-driven police operations. We release our data and prompt template.

Evaluating LLMs for Police Decision-Making: A Framework Based on Police Action Scenarios

Large language models (LLMs) have achieved remarkable success in many domains, but concerns about data quality and privacy are growing. Federated Learning (FL) offers a privacy-preserving solution by training a model on local clients without sharing data. However, the impact of biased private data on LLMs fine-tuned through FL remains understudied. This work investigates how client-side biased data affects the global model during federated fine-tuning of LLMs. We simulate realistic scenarios where some clients possess datasets containing social biases (stereotypes, discriminatory language) while others have clean data through extensive experiments with popular FL algorithms (FedAvg, FedAdam and FedProx) and popular LLMs (LLaMA, Mistral, Phi-3 and Gemma) across datasets with varying bias proportions (33\%, 66\%, 100\%). Our findings reveal that 1) FedAdam consistently shows the lowest bias propagation, reducing CrowS-Pairs scores by up to 15\% compared to FedAvg; 2) Even small amounts of biased data (33\%) can significantly influence global model bias; 3) Mixed biased and neutral data distributions lead to 5-7\% higher bias scores than segregated distributions. Additionally, we propose Bias-Aware Model Aggregation (BAMA), a novel debiasing method for federated fine-tuning that consistently reduces bias across various models and algorithms.

Investigating Social Bias Propagation in Federated Fine-tuning of Large Language Models

Graph neural networks (GNNs) are widely used in urban spatiotemporal forecasting, e.g., predicting infrastructure problems. In this setting, government officials aim to identify in which neighborhoods incidents like potholes or rodents occur. The true state of incidents is observed via government inspection *ratings*. However, these ratings are only conducted for a sparse set of neighborhoods and incident types. We also observe the state of incidents via crowdsourced *reports*, which are more densely observed but may be biased due to heterogeneous reporting. First, we propose a multiview, multioutput GNN-based model that uses both unbiased rating data and biased reporting data to predict the true latent state of incidents. Second, we investigate a case study of New York City urban incidents and collect a dataset of 9,615,863 crowdsourced reports and 1,041,415 government inspection ratings over 3 years and across 139 types of incidents. We show on both real and semi-synthetic data that our model can better predict the latent state compared to models that use only reporting data or only rating data. Finally, we quantify demographic biases in crowdsourced reporting, e.g., higher-income neighborhoods report problems at higher rates. Our analysis showcases a widely applicable approach for latent state prediction using heterogeneous, sparse, and biased data.

Urban Incident Prediction with Graph Neural Networks: Integrating Government Ratings and Crowdsourced Reports

People who stutter (PWS) face systemic exclusion in today’s voice-driven society, where access to voice assistants, authentication systems, and remote work tools increasingly depends on fluent speech. Current automatic speech recognition (ASR) systems, trained predominantly on fluent speech, fail to serve millions of PWS worldwide. We present STEAMROLLER, a real time system that transforms stuttered speech into fluent output through a novel multi-stage, multi-agent AI pipeline. Our approach addresses three critical technical challenges: (1) the difficulty of direct speech to speech conversion for disfluent input, (2) semantic distortions introduced during ASR transcription of stuttered speech, and (3) latency constraints for real time communication. STEAMROLLER employs a three stage architecture comprising ASR transcription, multi-agent text repair, and speech synthesis, where our core innovation lies in a collaborative multi-agent framework that iteratively refines transcripts while preserving semantic intent. Experiments on the FluencyBank dataset and a user study demonstrates clear word error rate (WER) reduction and strong user satisfaction. Beyond immediate accessibility benefits, fine tuning ASR on STEAMROLLER repaired speech further yields additional WER improvements, creating a pathway toward inclusive AI ecosystems.

STEAMROLLER: A Multi-Agent System for Inclusive Automatic Speech Recognition for People Who Stutter

In applications across agriculture, ecology, and human development, machine learning with satellite imagery (SatML) is limited by the sparsity of labeled training data. While satellite data cover the globe, labeled training datasets for SatML are often small, spatially clustered, and collected for other purposes (e.g., administrative surveys or field measurements). Despite the pervasiveness of this issue in practice, past SatML research has largely focused on new model architectures and training algorithms to handle scarce training data, rather than modeling data conditions directly. This leaves scientists and policymakers who wish to use SatML for large-scale monitoring uncertain about whether and how to collect additional data to maximize performance. Here, we present the first problem formulation for the optimization of spatial training data in the presence of heterogeneous data collection costs and realistic budget constraints, as well as novel methods for addressing this problem. In experiments simulating different problem settings across three continents and four tasks, our strategies reveal substantial gains from sample optimization. Further experiments delineate settings for which optimized sampling is particularly effective. The problem formulation and methods we introduce are designed to generalize across application domains for SatML; we put special emphasis on a specific problem setting where our coauthors can immediately use our findings to augment clustered agricultural surveys for SatML monitoring in Togo.

Mapping on a Budget: Optimizing Spatial Data Collection for ML

Electric vehicles (EVs) are essential for sustainable mobility and combating climate change. EV performance heavily relies on lithium-ion batteries (LIBs), which degrade over time, reducing driving range and increasing maintenance costs. Prolonged exposure to high states of charge (SOC) accelerates battery degradation, which can be mitigated by delaying full charging (\ours). However, successful implementation of \ours requires accurate predictions of user departure times to ensure vehicles reach full charge precisely before use. In this work, we propose Transformer-based real-time-to-event (TTE) model for accurate EV departure prediction. Our approach models each day as a TTE sequence by discretizing the timeline into grids, which are represented as tokens. Unlike previous methods primarily dependent on temporal dependency from historical patterns, our method leverages streaming contextual behavioral and environmental information to predict departures. Evaluation on a real-world study involving 93 users and passive smartphone data demonstrates that our method effectively captures irregular departure patterns within individual routines, significantly outperforming baseline models. Personalized fine-tuning further improves prediction accuracy, highlighting our approach’s potential for practical deployment of the \ours algorithm and its contribution to sustainable transportation systems.

Enabling Delayed-Full Charging Through Transformer-Based Real-Time-to-Departure Modeling for EV Battery Longevity

As corporate responsibility increasingly incorporates environmental, social, and governance (ESG) criteria, ESG reporting is becoming a legal obligation in many regions. These reports serve as a primary mechanism for organizations to document sustainability practices and for stakeholders to evaluate long-term viability and ethical performance. Ensuring regulatory compliance demands disclosures that are accurate, transparent, and verifiable. However, the complexity and scale of ESG disclosures present challenges for interpretation and automated analysis. To facilitate scalable and trustworthy analysis of these reports, this paper introduces ESG-Bench, a novel benchmark dataset aimed at advancing research in ESG report understanding and hallucination mitigation for large language models (LLMs). ESG-Bench consists of human-annotated question–answer (QA) pairs grounded in real-world ESG report contexts, along with fine-grained labels indicating whether model responses are factually supported or hallucinated. By framing ESG report analysis as a QA task with verifiability constraints, ESG-Bench enables systematic evaluation of LLMs' ability to extract and reason over ESG content.
We also uncover a previously unexplored use case: applying ESG-Bench to mitigate hallucinations in socially sensitive and compliance-critical contexts. To this end, we design task-specific Chain-of-Thought (CoT) prompting strategies and fine-tune multiple state-of-the-art LLMs on ESG-Bench using CoT-annotated rationales. Experimental results demonstrate that these CoT-based strategies substantially outperform standard prompting and direct fine-tuning, effectively mitigating hallucinations across benchmarks and highlighting the unique challenges of long-context document reasoning in the ESG setting. We also evaluate our approach across existing QA benchmarks to assess generalization beyond the ESG domain.

ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

Understanding the complex host-seeking behavior of disease vectors such as mosquito is critical for predicting disease transmission and vector control. This behavior arises from a dynamic interplay between multi-modal sensory cues and internal behavioral states, a process ill-suited for traditional ODE frameworks due to its inherent stochasticity and discrete, state-based nature. We introduce the Behavioral State Attention Network (BSAN), a deep learning architecture designed to model the underlying sensorimotor computations of this behavior. BSAN utilizes a recurrent neural network (RNN) with an LSTM core to process temporal sequences, incorporating a variational encoder to capture the randomness of flight paths and a Mixture Density Network (MDN) to predict multi-modal velocity distributions. The architecture explicitly models distinct behavioral states, such as $CO_2$ plume tracking and thermal approach, through a Mixture-of-Experts (MoE) framework, and learns to interpretably integrate olfactory, thermal, and visual inputs using a cross-modal attention mechanism. The network generates realistic flight trajectories that exhibit emergent host-seeking behaviors. By providing both trajectory predictions and interpretable behavioral primitives, BSAN serves as a framework for downstream applications in landscape genomics and vector control, enabling the prediction of mosquito population connectivity through environment-specific movement kernels.

BSAN: Behavioral State Attention Network for Modeling Mosquito Host-Seeking Behavior

Understanding human attitudes, preferences, and behaviors through social surveys is essential for academic research and policymaking. Yet traditional surveys face persistent challenges, including fixed-question formats, high costs, limited adaptability, and difficulties ensuring cross-cultural equivalence. While recent studies explore large language models (LLMs) to simulate survey responses, most are limited to structured questions, overlook the entire survey process, and risks under-representing marginalized groups due to training data biases. We introduce AlignSurvey, the first benchmark that systematically replicates and evaluates the full social survey pipeline using LLMs. It defines four tasks aligned with key survey stages: social role modeling, semi-structured interview modeling, attitude stance modeling and survey response modeling. It also provides task-specific evaluation metrics to assess alignment fidelity, consistency, and fairness at both individual and group levels, with a focus on demographic diversity. To support AlignSurvey, we construct a multi-tiered dataset architecture: (i) the Social Foundation Corpus, a cross-national resource with 44K+ interview dialogues and 400K+ structured survey records; and (ii) a suite of Entire-Pipeline Survey Datasets, including the expert-annotated AlignSurvey-Expert (ASE) and two nationally representative surveys for cross-cultural evaluation. We release the SurveyLM family, obtained through two-stage fine-tuning of open-source LLMs, and offer reference models for evaluating domain-specific alignment. All datasets, models, and tools are available at github and huggingface to support transparent and socially responsible research.

AlignSurvey: A Comprehensive Benchmark for Human Preferences Alignment in Social Surveys

Image classification systems often inherit biases from uneven group representation in training data. For example, in face datasets for hair color classification, blond hair may be disproportionately associated with females, reinforcing stereotypes. A recent approach leverages the Stable Diffusion model to generate balanced training data, but these models often struggle to preserve the original data distribution. In this work, we explore multiple diffusion-finetuning techniques, e.g., LoRA and DreamBooth, to generate images that more accurately represent each training group by learning directly from their samples. Additionally, in order to prevent a single DreamBooth model from being overwhelmed by excessive intra-group variations, we explore a technique of clustering images within each group and train a DreamBooth model per cluster. These models are then used to generate group-balanced data for pretraining, followed by fine-tuning on real data. Experiments on multiple benchmarks demonstrate that the studied finetuning approaches outperform vanilla Stable Diffusion on average and achieve results comparable to SOTA debiasing techniques like Group-DRO, while surpassing them as the dataset bias severity increases. Code will be made public upon acceptance.

Downloads

Next from AAAI 2026

Evaluating LLMs for Police Decision-Making: A Framework Based on Police Action Scenarios

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES