Singapore

The use of Large Language Models (LLMs) in police opera- tions is growing, yet an evaluation framework tailored to po- lice operations remains absent. While LLM’s responses may not always be legally “incorrect”, their unverified use still can lead to severe issues such as unlawful arrests and improper evidence collection. To address this, we propose PAS (Po- lice Action Scenarios), a systematic framework covering the entire evaluation process. Applying this framework, we con- structed a novel QA dataset from over 8,000 official docu- ments and established key metrics validated through statis- tical analysis with police expert judgements. Experimental results show that commercial LLMs struggle with our new police-related tasks, particularly in providing fact-based rec- ommendations. This study highlights the necessity of an ex- pandable evaluation framework to ensure reliable AI-driven police operations. We release our data and prompt template.

AAAI 2026

Evaluating LLMs for Police Decision-Making: A Framework Based on Police Action Scenarios

policing

policy and social development

large language models

evaluation

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Large language models (LLMs) have achieved remarkable success in many domains, but concerns about data quality and privacy are growing. Federated Learning (FL) offers a privacy-preserving solution by training a model on local clients without sharing data. However, the impact of biased private data on LLMs fine-tuned through FL remains understudied. This work investigates how client-side biased data affects the global model during federated fine-tuning of LLMs. We simulate realistic scenarios where some clients possess datasets containing social biases (stereotypes, discriminatory language) while others have clean data through extensive experiments with popular FL algorithms (FedAvg, FedAdam and FedProx) and popular LLMs (LLaMA, Mistral, Phi-3 and Gemma) across datasets with varying bias proportions (33\%, 66\%, 100\%). Our findings reveal that 1) FedAdam consistently shows the lowest bias propagation, reducing CrowS-Pairs scores by up to 15\% compared to FedAvg; 2) Even small amounts of biased data (33\%) can significantly influence global model bias; 3) Mixed biased and neutral data distributions lead to 5-7\% higher bias scores than segregated distributions. Additionally, we propose Bias-Aware Model Aggregation (BAMA), a novel debiasing method for federated fine-tuning that consistently reduces bias across various models and algorithms.

Investigating Social Bias Propagation in Federated Fine-tuning of Large Language Models

Graph neural networks (GNNs) are widely used in urban spatiotemporal forecasting, e.g., predicting infrastructure problems. In this setting, government officials aim to identify in which neighborhoods incidents like potholes or rodents occur. The true state of incidents is observed via government inspection *ratings*. However, these ratings are only conducted for a sparse set of neighborhoods and incident types. We also observe the state of incidents via crowdsourced *reports*, which are more densely observed but may be biased due to heterogeneous reporting. First, we propose a multiview, multioutput GNN-based model that uses both unbiased rating data and biased reporting data to predict the true latent state of incidents. Second, we investigate a case study of New York City urban incidents and collect a dataset of 9,615,863 crowdsourced reports and 1,041,415 government inspection ratings over 3 years and across 139 types of incidents. We show on both real and semi-synthetic data that our model can better predict the latent state compared to models that use only reporting data or only rating data. Finally, we quantify demographic biases in crowdsourced reporting, e.g., higher-income neighborhoods report problems at higher rates. Our analysis showcases a widely applicable approach for latent state prediction using heterogeneous, sparse, and biased data.

Urban Incident Prediction with Graph Neural Networks: Integrating Government Ratings and Crowdsourced Reports

People who stutter (PWS) face systemic exclusion in today’s voice-driven society, where access to voice assistants, authentication systems, and remote work tools increasingly depends on fluent speech. Current automatic speech recognition (ASR) systems, trained predominantly on fluent speech, fail to serve millions of PWS worldwide. We present STEAMROLLER, a real time system that transforms stuttered speech into fluent output through a novel multi-stage, multi-agent AI pipeline. Our approach addresses three critical technical challenges: (1) the difficulty of direct speech to speech conversion for disfluent input, (2) semantic distortions introduced during ASR transcription of stuttered speech, and (3) latency constraints for real time communication. STEAMROLLER employs a three stage architecture comprising ASR transcription, multi-agent text repair, and speech synthesis, where our core innovation lies in a collaborative multi-agent framework that iteratively refines transcripts while preserving semantic intent. Experiments on the FluencyBank dataset and a user study demonstrates clear word error rate (WER) reduction and strong user satisfaction. Beyond immediate accessibility benefits, fine tuning ASR on STEAMROLLER repaired speech further yields additional WER improvements, creating a pathway toward inclusive AI ecosystems.

STEAMROLLER: A Multi-Agent System for Inclusive Automatic Speech Recognition for People Who Stutter

In applications across agriculture, ecology, and human development, machine learning with satellite imagery (SatML) is limited by the sparsity of labeled training data. While satellite data cover the globe, labeled training datasets for SatML are often small, spatially clustered, and collected for other purposes (e.g., administrative surveys or field measurements). Despite the pervasiveness of this issue in practice, past SatML research has largely focused on new model architectures and training algorithms to handle scarce training data, rather than modeling data conditions directly. This leaves scientists and policymakers who wish to use SatML for large-scale monitoring uncertain about whether and how to collect additional data to maximize performance. Here, we present the first problem formulation for the optimization of spatial training data in the presence of heterogeneous data collection costs and realistic budget constraints, as well as novel methods for addressing this problem. In experiments simulating different problem settings across three continents and four tasks, our strategies reveal substantial gains from sample optimization. Further experiments delineate settings for which optimized sampling is particularly effective. The problem formulation and methods we introduce are designed to generalize across application domains for SatML; we put special emphasis on a specific problem setting where our coauthors can immediately use our findings to augment clustered agricultural surveys for SatML monitoring in Togo.

Mapping on a Budget: Optimizing Spatial Data Collection for ML

Electric vehicles (EVs) are essential for sustainable mobility and combating climate change. EV performance heavily relies on lithium-ion batteries (LIBs), which degrade over time, reducing driving range and increasing maintenance costs. Prolonged exposure to high states of charge (SOC) accelerates battery degradation, which can be mitigated by delaying full charging (\ours). However, successful implementation of \ours requires accurate predictions of user departure times to ensure vehicles reach full charge precisely before use. In this work, we propose Transformer-based real-time-to-event (TTE) model for accurate EV departure prediction. Our approach models each day as a TTE sequence by discretizing the timeline into grids, which are represented as tokens. Unlike previous methods primarily dependent on temporal dependency from historical patterns, our method leverages streaming contextual behavioral and environmental information to predict departures. Evaluation on a real-world study involving 93 users and passive smartphone data demonstrates that our method effectively captures irregular departure patterns within individual routines, significantly outperforming baseline models. Personalized fine-tuning further improves prediction accuracy, highlighting our approach’s potential for practical deployment of the \ours algorithm and its contribution to sustainable transportation systems.

Enabling Delayed-Full Charging Through Transformer-Based Real-Time-to-Departure Modeling for EV Battery Longevity

As corporate responsibility increasingly incorporates environmental, social, and governance (ESG) criteria, ESG reporting is becoming a legal obligation in many regions. These reports serve as a primary mechanism for organizations to document sustainability practices and for stakeholders to evaluate long-term viability and ethical performance. Ensuring regulatory compliance demands disclosures that are accurate, transparent, and verifiable. However, the complexity and scale of ESG disclosures present challenges for interpretation and automated analysis. To facilitate scalable and trustworthy analysis of these reports, this paper introduces ESG-Bench, a novel benchmark dataset aimed at advancing research in ESG report understanding and hallucination mitigation for large language models (LLMs). ESG-Bench consists of human-annotated question–answer (QA) pairs grounded in real-world ESG report contexts, along with fine-grained labels indicating whether model responses are factually supported or hallucinated. By framing ESG report analysis as a QA task with verifiability constraints, ESG-Bench enables systematic evaluation of LLMs' ability to extract and reason over ESG content.
We also uncover a previously unexplored use case: applying ESG-Bench to mitigate hallucinations in socially sensitive and compliance-critical contexts. To this end, we design task-specific Chain-of-Thought (CoT) prompting strategies and fine-tune multiple state-of-the-art LLMs on ESG-Bench using CoT-annotated rationales. Experimental results demonstrate that these CoT-based strategies substantially outperform standard prompting and direct fine-tuning, effectively mitigating hallucinations across benchmarks and highlighting the unique challenges of long-context document reasoning in the ESG setting. We also evaluate our approach across existing QA benchmarks to assess generalization beyond the ESG domain.

ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

Understanding the complex host-seeking behavior of disease vectors such as mosquito is critical for predicting disease transmission and vector control. This behavior arises from a dynamic interplay between multi-modal sensory cues and internal behavioral states, a process ill-suited for traditional ODE frameworks due to its inherent stochasticity and discrete, state-based nature. We introduce the Behavioral State Attention Network (BSAN), a deep learning architecture designed to model the underlying sensorimotor computations of this behavior. BSAN utilizes a recurrent neural network (RNN) with an LSTM core to process temporal sequences, incorporating a variational encoder to capture the randomness of flight paths and a Mixture Density Network (MDN) to predict multi-modal velocity distributions. The architecture explicitly models distinct behavioral states, such as $CO_2$ plume tracking and thermal approach, through a Mixture-of-Experts (MoE) framework, and learns to interpretably integrate olfactory, thermal, and visual inputs using a cross-modal attention mechanism. The network generates realistic flight trajectories that exhibit emergent host-seeking behaviors. By providing both trajectory predictions and interpretable behavioral primitives, BSAN serves as a framework for downstream applications in landscape genomics and vector control, enabling the prediction of mosquito population connectivity through environment-specific movement kernels.

BSAN: Behavioral State Attention Network for Modeling Mosquito Host-Seeking Behavior

Understanding human attitudes, preferences, and behaviors through social surveys is essential for academic research and policymaking. Yet traditional surveys face persistent challenges, including fixed-question formats, high costs, limited adaptability, and difficulties ensuring cross-cultural equivalence. While recent studies explore large language models (LLMs) to simulate survey responses, most are limited to structured questions, overlook the entire survey process, and risks under-representing marginalized groups due to training data biases. We introduce AlignSurvey, the first benchmark that systematically replicates and evaluates the full social survey pipeline using LLMs. It defines four tasks aligned with key survey stages: social role modeling, semi-structured interview modeling, attitude stance modeling and survey response modeling. It also provides task-specific evaluation metrics to assess alignment fidelity, consistency, and fairness at both individual and group levels, with a focus on demographic diversity. To support AlignSurvey, we construct a multi-tiered dataset architecture: (i) the Social Foundation Corpus, a cross-national resource with 44K+ interview dialogues and 400K+ structured survey records; and (ii) a suite of Entire-Pipeline Survey Datasets, including the expert-annotated AlignSurvey-Expert (ASE) and two nationally representative surveys for cross-cultural evaluation. We release the SurveyLM family, obtained through two-stage fine-tuning of open-source LLMs, and offer reference models for evaluating domain-specific alignment. All datasets, models, and tools are available at github and huggingface to support transparent and socially responsible research.

AlignSurvey: A Comprehensive Benchmark for Human Preferences Alignment in Social Surveys

Image classification systems often inherit biases from uneven group representation in training data. For example, in face datasets for hair color classification, blond hair may be disproportionately associated with females, reinforcing stereotypes. A recent approach leverages the Stable Diffusion model to generate balanced training data, but these models often struggle to preserve the original data distribution. In this work, we explore multiple diffusion-finetuning techniques, e.g., LoRA and DreamBooth, to generate images that more accurately represent each training group by learning directly from their samples. Additionally, in order to prevent a single DreamBooth model from being overwhelmed by excessive intra-group variations, we explore a technique of clustering images within each group and train a DreamBooth model per cluster. These models are then used to generate group-balanced data for pretraining, followed by fine-tuning on real data. Experiments on multiple benchmarks demonstrate that the studied finetuning approaches outperform vanilla Stable Diffusion on average and achieve results comparable to SOTA debiasing techniques like Group-DRO, while surpassing them as the dataset bias severity increases. Code will be made public upon acceptance.

Harnessing Diffusion-Generated Synthetic Images for Fair Image Classification

Personalized insulin therapy for individuals with Type 1 Diabetes via closed‑loop artificial pancreas systems requires rapid adaptation of dosing strategies to each patient's unique insulin response. However, learning patient‑specific policies from scratch demands extensive exploration, which is often impractical. In this work, we study a framework that integrates insulin-response-informed transfer learning with model-based reinforcement learning for insulin dosing. We first train an LSTM‑based insulin responsiveness predictor on virtual patients, using their glucose, insulin, and meal history to forecast future glucose levels. Analysis of insulin responsiveness of in-silico patients uncovers natural insulin‑response groups characterized by similar sensitivity and dynamics profiles. For a new patient, we identify a representative model from their response group and use it to generate synthetic trajectories. These trajectories are integrated into an enhanced H-step Deep Dyna-Q algorithm, enabling accelerated policy optimization through model-based planning. The dynamics model trained entirely in simulation achieves 91.31\% accuracy in predicting blood glucose ranges on the Ohio Type 1 Diabetes dataset, indicating strong zero-shot generalization. Additionally, we find that bootstrapping a new patient with a physiologically-matched reference model accelerates convergence of effective dosing policies across in-silico cohorts of children, adolescents, and adults. These findings suggest that leveraging response-group-specific synthetic experience can expedite personalized insulin therapy, offering a promising pathway towards clinical validation.

Content not yet available

Next from AAAI 2026

Investigating Social Bias Propagation in Federated Fine-tuning of Large Language Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES