Singapore

Classifiers trained on historical data are deployed in the real-world to automate decisions from hiring to loan issuance. Judging the fairness and efficiency of these systems, and their human counterparts, is a complex and important topic studied across both computational and social sciences. One common way to address bias in classifiers is to resample the training data to offset distributional disparities. In the hiring domain, where results may vary by a protected class, many interventions from the literature equalize the hiring rate within the training set to alleviate bias in the resulting classifier. While simple and seemingly effective, these methods have typically only been evaluated using data obtained through convenience samples, e.g., results of some real world hiring process, introducing selection and label bias into the evaluation. In the social and health sciences, audit studies, in which fictitious ``testers&#39;&#39; (resumes) are sent to subjects (job openings) in a randomized control trial, provide high quality data that support rigorous estimates of discrimination by controlling for confounding factors. In this paper, we investigate how data from audit studies can be used to improve our ability to both train and evaluate automated hiring algorithms. We find that audit data of real-world hiring reveals cases where equalizing base rates across classes \emph{appears} to achieve parity using traditional measures, but in fact has $\approx$ 10\% disparity when measured appropriately. We also show that corrections based on individual treatment effect estimation methods combined with audit study data can overcome these issues, underscoring the need for rigorous data collection in fairness research.

AAAI 2026

The Illusion of Fairness: Auditing Fairness Interventions in Algorithmic Hiring with Audit Studies

fairness interventions

audting

audit studies

algorithmic hiring

causal inference

ethics

fairness

Classifiers trained on historical data are deployed in the real-world to automate decisions from hiring to loan issuance. Judging the fairness and efficiency of these systems, and their human counterparts, is a complex and important topic studied across both computational and social sciences. One common way to address bias in classifiers is to resample the training data to offset distributional disparities. In the hiring domain, where results may vary by a protected class, many interventions from the literature equalize the hiring rate within the training set to alleviate bias in the resulting classifier. While simple and seemingly effective, these methods have typically only been evaluated using data obtained through convenience samples, e.g., results of some real world hiring process, introducing selection and label bias into the evaluation. In the social and health sciences, audit studies, in which fictitious ``testers'' (resumes) are sent to subjects (job openings) in a randomized control trial, provide high quality data that support rigorous estimates of discrimination by controlling for confounding factors. In this paper, we investigate how data from audit studies can be used to improve our ability to both train and evaluate automated hiring algorithms. We find that audit data of real-world hiring reveals cases where equalizing base rates across classes \emph{appears} to achieve parity using traditional measures, but in fact has $\approx$ 10\% disparity when measured appropriately. We also show that corrections based on individual treatment effect estimation methods combined with audit study data can overcome these issues, underscoring the need for rigorous data collection in fairness research.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

We introduce PandemIQ Llama, a domain-adapted large language model (LLM) designed specifically for pandemic intelligence applications. Building on the pre-trained Llama-3.1-8B model, we conducted continuous training using our curated Pandemic Corpus. This dataset was assembled from authoritative public health sources, scientific literature, and specialized knowledge repositories, comprising 508,924 documents totaling 5.8 billion tokens, which is the largest pandemic domain specific data cohort for LLM training. 
Benefited from our curated large data cohorts and through continuous training leveraging extensive computational resources, the developed PandemIQ Llama model can extract critical domain knowledge on pandemic, which is typically underrepresented in general-purpose language models, To evaluate its performance, we conducted comprehensive comparison of PandemIQ Llama with both prompt-engineered and task-specific fine-tuned baseline models using two tasks: the Biomedical Alert News Question Answering task (1,508 reports with 30 expert-generated questions each) and the Disease Event Type Classification benchmark (4,500 news snippets across eight disease categories). PandemIQ Llama demonstrated substantial improvements over strong baseline models, achieving performance gains ranging from 3.8% to 10.97%. These results suggest that PandemIQ Llama could significantly enhance public health surveillance and analysis capabilities. In addition, our result also suggests that the LLMs can perform better with continuous training than fine-tuning on domain specific tasks. Social Impact: This model will be integrated with Epidemic Intelligence from Open Sources (EIOS) run by World Health Organization (WHO). This integration will empower a large community of decision makers and stakeholders in all WHO member countries with the first LLM-based AI tool for pandemic surveillance.

PandemIQ Llama: A Domain-Adapted Foundation Model for Enhanced Pandemic Intelligence

Investigating the effects of climate change and global warming caused by GHG emissions have been a central focus worldwide. These emissions are largely contributed to by the production, use and disposal of consumer products. Thus, it is important to build tools to estimate the environmental impact of consumer goods, an essential part of which is conducting Life Cycle Assessments (LCAs). LCAs specify and account for the appropriate processes involved with the production, use, and disposal of the products. We present SpiderGen, an LLM-based workflow which integrates the taxonomy and methodology of traditional LCA with the reasoning capabilities and world knowledge of LLMs to generate this procedural information used for LCA. We additionally apply evaluation methods for this use-case, and evaluate the output of SpiderGen with real-world LCA documents. We find that SpiderGen provides accurate LCA process information that is either fully correct or has minor errors on average 60\% of the time. We observe that the remaining missed processes and hallucinated errors occur primarily due to differences in detail between LCA documents, as well as differences in the understanding of ``scope" of which auxiliary processes must also be included. We also demonstrate that SpiderGen performs better than several baselines techniques, such as chain-of-thought prompting and one-shot prompting. Finally, we highlight that SpiderGen has the potential to drastically reduce the human effort and costs for estimating carbon impact, as it is able to produce LCA process information for less than \$1 USD in under 10 minutes as compared to the status quo LCA, which costs over \$25000 USD and take up to 21-person days.

SpiderGen: Towards Procedure Generation for Carbon Life Cycle Assessments with Generative AI

Equitable formative feedback remains out of reach for large or low-resource courses because instructors cannot read every learner reflection. We present a theory-grounded pipeline of five role-based LLM agents—Evaluator, Equity Monitor, Metacognitive Coach, Aggregator, and Reflexion Reviewer—that jointly produce calibrated rubric scores and $\le 120$-word, bias-aware comments.
On $84$ reflections from a $12$-session AI-literacy program, the pipeline matches expert raters ($\mathrm{MAE}=0.47$, $\mathrm{QWK}=0.46$, human-AI $\mathrm{ICC}=0.41$) while bounding the worst-ability error gap to $\Delta{\mathrm{MAE}}=0.50$.
Automated scoring finishes in $7.7\text{s}$ per reflection—an $11\times$ speed-up over the human mean of $1.4\text{min}$—and complete feedback in $33\text{s}$.
A full agentic run costs just $0.0016$ per reflection.
Three trained graders rated the AI feedback highly useful (overall $Q(g)=3.97/5$) with top marks for empathy ($4.22/5$).
Contributions. (i) A self-consistent scoring scheme with equity safeguards; (ii) a role-based agent ensemble for dialogic, bias-aware feedback; and (iii) the first open dataset, prompts, and codebase for equitable reflection assessment. Together, these advances demonstrate a practical path toward large-scale, fair feedback in real classrooms.

Scaling Equitable Reflection Assessment in Education via Large Language Models and Role-Based Feedback Agents

Social media memes are a challenging domain for hate detection because they intertwine visual and textual cues into culturally nuanced messages. To tackle these challenges, we introduce TRACE, a hierarchical multimodal framework that leverages visually grounded context augmentation, along with a novel caption-scoring network to emphasize hate-relevant content, and parameter-efficient fine-tuning of CLIP’s text encoder. Our experiments demonstrate that selectively fine-tuning deeper text encoder layers significantly enhances performance compared to simpler projection-layer fine-tuning methods. Specifically, our framework achieves state-of-the-art accuracy (0.807) and F1-score (0.806) on the widely-used Hateful Memes dataset, matching the performance of considerably larger models while maintaining efficiency. Moreover, it achieves superior generalization on the MultiOFF offensive meme dataset (F1-score 0.673), highlighting robustness across meme categories. Additional analyses confirm that robust visual grounding and nuanced text representations significantly reduce errors caused by benign confounders. We will publicly release our code and models to facilitate future research.

TRACE: Textual Relevance Augmentation and Contextual Encoding for Multimodal Hate Detection

This paper presents the largest known benchmark dataset for road damage assessment and road alignment, and provides 18 baseline models trained on the CRASAR-U-DRIODs dataset’s post-disaster small uncrewed aerial systems (sUAS) imagery from 10 federally declared disasters, addressing three challenges within prior post-disaster road damage assessment datasets. While prior disaster road damage assessment datasets exist, there is no current state of practice, as prior public datasets have either been small-scale or reliant on low-resolution imagery insufficient for detecting phenomena of interest to emergency managers. Further, while machine learning (ML) systems have been developed for this task previously, none are known to have been operationally validated. These limitations are overcome in this work through the labeling of 657.25km of roads according to a 10-class labeling schema, followed by training and deploying ML models during the operational response to Hurricanes Debby and Helene in 2024. Motivated by observed road line misalignment in practice, 9,184 road line adjustments were provided for spatial alignment of a priori road lines, as it was found that when the 18 baseline models are deployed against real-world misaligned road lines, model performance degraded on average by 5.596% Macro IoU. If spatial alignment is not considered, approximately 8% (11km) of adverse conditions on road lines will be labeled incorrectly, with approximately 9% (59km) of road lines misaligned off the actual road. These dynamics are gaps that should be addressed by the ML, CV, and robotics
communities to enable more effective and informed decision-making during disasters.

A Benchmark Dataset for Spatially Aligned Road Damage Assessment in Small Uncrewed Aerial Systems Disaster Imagery

Integrating Large Language Models (LLMs) into judicial decision-making demands rigorous safety examination against non-legal influences. This paper presents a novel stress test where we evaluate LLM-generated labor dispute outcomes by introducing social media sentiment as an external pressure, critically comparing them against 10,000 real-world court judgments from China Judgments Online (CJOL). Our findings reveal significant LLM safety vulnerabilities: models exhibit inherent deviations from real rulings, and public opinion substantially amplifies these discrepancies, leading to unstable and often inflated compensation predictions. Furthermore, these safety risks are compounded across low-skilled occupational categories and emotionally charged topics. This study uncovers critical threats to judicial integrity and public trust, underscoring the urgent need for robust safeguards against non-legal influences in AI legal systems.

LLM Safety in Judicial AI: A Stress Test of Social Media Influence on Real-World Judgments

Sustainability is becoming increasingly critical in the maritime transport, encompassing both environmental and social impacts, such as Greenhouse Gas (GHG) emissions and navigational safety. Traditional vessel navigation heavily relies on human experience, often lacking autonomy and emission awareness, and is prone to human errors that may compromise safety. In this paper, we propose a Curriculum Reinforcement Learning (CRL) framework integrated with a realistic, data-driven marine simulation environment and a machine learning-based fuel consumption prediction module. The simulation environment is constructed using real-world vessel movement data and enhanced with a Diffusion Model to simulate dynamic maritime conditions. Vessel fuel consumption is estimated using historical operational data and learning-based regression. The surrounding environment is represented as image-based inputs to capture spatial complexity. We design a lightweight, policy-based CRL agent with a comprehensive reward mechanism that considers safety, emissions, timeliness, and goal completion. This framework effectively handles complex tasks progressively while ensuring stable and efficient learning in continuous action spaces. We validate the proposed approach in a sea area of the Indian Ocean, demonstrating its efficacy in enabling sustainable and safe vessel navigation.

Realistic Curriculum Reinforcement Learning for Autonomous and Sustainable Marine Vessel Navigation

The opioid crisis represents a significant moment in public health that reveals systemic shortcomings across regulatory systems, healthcare practices, corporate governance, and public policy. Analyzing how these interconnected systems simultaneously failed to protect public health requires innovative analytic approaches for exploring the vast amounts of data and documents disclosed in the UCSF-JHU Opioid Industry Documents Archive (OIDA).
The complexity, multimodal nature, and specialized characteristics of these healthcare-related legal and corporate documents necessitate more advanced methods and models tailored to specific data types and detailed annotations, ensuring the precision and professionalism in the analysis.
In this paper, we tackle this challenge by organizing the original dataset according to document attributes and constructing a benchmark with 400k training documents and 10k for testing. 
From each document, we extract rich multimodal information—including textual content, visual elements, and layout structures—to capture a comprehensive range of features. 
Using multiple AI models, we then generate a large-scale dataset comprising 360k training QA pairs and 10k testing QA pairs.
Building on this foundation, we develop domain-specific multimodal Large Language Models (LLMs) and explore the impact of multimodal inputs on task performance. To further enhance response accuracy, we incorporate historical QA pairs as contextual grounding for answering current queries.
Additionally, we incorporate page references within the answers and introduce an importance-based page classifier, further improving the precision and relevance of the information provided.
Preliminary results indicate the improvements with our AI assistant in document information extraction and question-answering tasks, 
highlighting the effectiveness of our benchmark in addressing the opioid crisis.
Data and models will be released for public research.

OIDA-QA: A Multimodal Benchmark for Analyzing the Opioid Industry Documents Archive

The diversity across populations and the variability between individuals have long posed a significant challenge in cognitive science. Although large language models (LLMs) have made notable progress in aligning with human values, faithfully capturing the high degree of diversity and uncertainty in human judgment remains an unresolved challenge.This study investigates whether computational models, or `proxy agents," can not only emulate human decision patterns but also systematically modulate them. We propose a framework wherein we first fine-tune BERT-based proxy agents to replicate both aggregate and individual-level human judgments on a large-scale moral dilemma dataset. We then hypothesize that stimuli identified as maximally divisive for these individualized agents will similarly elicit high disagreement among human participants. Through a human-in-the-loop experiment, we validate this hypothesis, demonstrating that agent-selected stimuli can predictably induce targeted divergence in human moral choices. Our findings provide empirical evidence that AI agents can bias human perceptual variability by strategically filtering information. We further analyze this induced moral divergence using a Bayesian framework and concept decomposition to identify the distinct conceptual dimensions driving individual differences. This work quantifies the potential for AI-driven cognitive modulation and underscores the urgent need for ethical guidelines to prevent the misuse of such capabilities.

When Proxy Agents Disagree, Do Humans Mirror? Manipulating Human Behavior in Moral Dilemmas Through Agents

Climate change poses a global threat to public health, food security, and economic stability. Addressing it requires evidence-based policies and a nuanced understanding of how the threat is perceived by the public, particularly within visual social media, where narratives quickly evolve through voices of individuals, politicians, NGOs, and institutions. This study investigates climate-related discourse on YouTube within the Brazilian context, a geopolitically significant nation in global environmental negotiations. Through three case studies, we examine (1) which psychological content traits most effectively drive audience engagement, (2) the extent to which these traits influence content popularity, and (3) whether such insights can inform the design of persuasive synthetic campaigns—such as climate denialism—using recent generative language models. Another contribution of this work is the release of a large publicly available dataset of 226K Brazilian YouTube videos and 2.7M user comments on climate change. The dataset includes fine-grained annotations of persuasive strategies, theory-of-mind categorizations in user responses, and typologies of content creators. This resource can help support future research on digital climate communication and the ethical risk of algorithmically amplified narratives and generative media.

Content not yet available

Next from AAAI 2026

PandemIQ Llama: A Domain-Adapted Foundation Model for Enhanced Pandemic Intelligence

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES