Singapore

As corporate responsibility increasingly incorporates environmental, social, and governance (ESG) criteria, ESG reporting is becoming a legal obligation in many regions. These reports serve as a primary mechanism for organizations to document sustainability practices and for stakeholders to evaluate long-term viability and ethical performance. Ensuring regulatory compliance demands disclosures that are accurate, transparent, and verifiable. However, the complexity and scale of ESG disclosures present challenges for interpretation and automated analysis. To facilitate scalable and trustworthy analysis of these reports, this paper introduces ESG-Bench, a novel benchmark dataset aimed at advancing research in ESG report understanding and hallucination mitigation for large language models (LLMs). ESG-Bench consists of human-annotated question–answer (QA) pairs grounded in real-world ESG report contexts, along with fine-grained labels indicating whether model responses are factually supported or hallucinated. By framing ESG report analysis as a QA task with verifiability constraints, ESG-Bench enables systematic evaluation of LLMs&#39; ability to extract and reason over ESG content.
We also uncover a previously unexplored use case: applying ESG-Bench to mitigate hallucinations in socially sensitive and compliance-critical contexts. To this end, we design task-specific Chain-of-Thought (CoT) prompting strategies and fine-tune multiple state-of-the-art LLMs on ESG-Bench using CoT-annotated rationales. Experimental results demonstrate that these CoT-based strategies substantially outperform standard prompting and direct fine-tuning, effectively mitigating hallucinations across benchmarks and highlighting the unique challenges of long-context document reasoning in the ESG setting. We also evaluate our approach across existing QA benchmarks to assess generalization beyond the ESG domain.

AAAI 2026

ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

nlp: fact-checking / misinformation detection (nlp focus)

nlp: question answering

nlp: safety and robustness

nlp: (large) language models

As corporate responsibility increasingly incorporates environmental, social, and governance (ESG) criteria, ESG reporting is becoming a legal obligation in many regions. These reports serve as a primary mechanism for organizations to document sustainability practices and for stakeholders to evaluate long-term viability and ethical performance. Ensuring regulatory compliance demands disclosures that are accurate, transparent, and verifiable. However, the complexity and scale of ESG disclosures present challenges for interpretation and automated analysis. To facilitate scalable and trustworthy analysis of these reports, this paper introduces ESG-Bench, a novel benchmark dataset aimed at advancing research in ESG report understanding and hallucination mitigation for large language models (LLMs). ESG-Bench consists of human-annotated question–answer (QA) pairs grounded in real-world ESG report contexts, along with fine-grained labels indicating whether model responses are factually supported or hallucinated. By framing ESG report analysis as a QA task with verifiability constraints, ESG-Bench enables systematic evaluation of LLMs' ability to extract and reason over ESG content.
We also uncover a previously unexplored use case: applying ESG-Bench to mitigate hallucinations in socially sensitive and compliance-critical contexts. To this end, we design task-specific Chain-of-Thought (CoT) prompting strategies and fine-tune multiple state-of-the-art LLMs on ESG-Bench using CoT-annotated rationales. Experimental results demonstrate that these CoT-based strategies substantially outperform standard prompting and direct fine-tuning, effectively mitigating hallucinations across benchmarks and highlighting the unique challenges of long-context document reasoning in the ESG setting. We also evaluate our approach across existing QA benchmarks to assess generalization beyond the ESG domain.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Understanding the complex host-seeking behavior of disease vectors such as mosquito is critical for predicting disease transmission and vector control. This behavior arises from a dynamic interplay between multi-modal sensory cues and internal behavioral states, a process ill-suited for traditional ODE frameworks due to its inherent stochasticity and discrete, state-based nature. We introduce the Behavioral State Attention Network (BSAN), a deep learning architecture designed to model the underlying sensorimotor computations of this behavior. BSAN utilizes a recurrent neural network (RNN) with an LSTM core to process temporal sequences, incorporating a variational encoder to capture the randomness of flight paths and a Mixture Density Network (MDN) to predict multi-modal velocity distributions. The architecture explicitly models distinct behavioral states, such as $CO_2$ plume tracking and thermal approach, through a Mixture-of-Experts (MoE) framework, and learns to interpretably integrate olfactory, thermal, and visual inputs using a cross-modal attention mechanism. The network generates realistic flight trajectories that exhibit emergent host-seeking behaviors. By providing both trajectory predictions and interpretable behavioral primitives, BSAN serves as a framework for downstream applications in landscape genomics and vector control, enabling the prediction of mosquito population connectivity through environment-specific movement kernels.

BSAN: Behavioral State Attention Network for Modeling Mosquito Host-Seeking Behavior

Understanding human attitudes, preferences, and behaviors through social surveys is essential for academic research and policymaking. Yet traditional surveys face persistent challenges, including fixed-question formats, high costs, limited adaptability, and difficulties ensuring cross-cultural equivalence. While recent studies explore large language models (LLMs) to simulate survey responses, most are limited to structured questions, overlook the entire survey process, and risks under-representing marginalized groups due to training data biases. We introduce AlignSurvey, the first benchmark that systematically replicates and evaluates the full social survey pipeline using LLMs. It defines four tasks aligned with key survey stages: social role modeling, semi-structured interview modeling, attitude stance modeling and survey response modeling. It also provides task-specific evaluation metrics to assess alignment fidelity, consistency, and fairness at both individual and group levels, with a focus on demographic diversity. To support AlignSurvey, we construct a multi-tiered dataset architecture: (i) the Social Foundation Corpus, a cross-national resource with 44K+ interview dialogues and 400K+ structured survey records; and (ii) a suite of Entire-Pipeline Survey Datasets, including the expert-annotated AlignSurvey-Expert (ASE) and two nationally representative surveys for cross-cultural evaluation. We release the SurveyLM family, obtained through two-stage fine-tuning of open-source LLMs, and offer reference models for evaluating domain-specific alignment. All datasets, models, and tools are available at github and huggingface to support transparent and socially responsible research.

AlignSurvey: A Comprehensive Benchmark for Human Preferences Alignment in Social Surveys

Image classification systems often inherit biases from uneven group representation in training data. For example, in face datasets for hair color classification, blond hair may be disproportionately associated with females, reinforcing stereotypes. A recent approach leverages the Stable Diffusion model to generate balanced training data, but these models often struggle to preserve the original data distribution. In this work, we explore multiple diffusion-finetuning techniques, e.g., LoRA and DreamBooth, to generate images that more accurately represent each training group by learning directly from their samples. Additionally, in order to prevent a single DreamBooth model from being overwhelmed by excessive intra-group variations, we explore a technique of clustering images within each group and train a DreamBooth model per cluster. These models are then used to generate group-balanced data for pretraining, followed by fine-tuning on real data. Experiments on multiple benchmarks demonstrate that the studied finetuning approaches outperform vanilla Stable Diffusion on average and achieve results comparable to SOTA debiasing techniques like Group-DRO, while surpassing them as the dataset bias severity increases. Code will be made public upon acceptance.

Harnessing Diffusion-Generated Synthetic Images for Fair Image Classification

Personalized insulin therapy for individuals with Type 1 Diabetes via closed‑loop artificial pancreas systems requires rapid adaptation of dosing strategies to each patient's unique insulin response. However, learning patient‑specific policies from scratch demands extensive exploration, which is often impractical. In this work, we study a framework that integrates insulin-response-informed transfer learning with model-based reinforcement learning for insulin dosing. We first train an LSTM‑based insulin responsiveness predictor on virtual patients, using their glucose, insulin, and meal history to forecast future glucose levels. Analysis of insulin responsiveness of in-silico patients uncovers natural insulin‑response groups characterized by similar sensitivity and dynamics profiles. For a new patient, we identify a representative model from their response group and use it to generate synthetic trajectories. These trajectories are integrated into an enhanced H-step Deep Dyna-Q algorithm, enabling accelerated policy optimization through model-based planning. The dynamics model trained entirely in simulation achieves 91.31\% accuracy in predicting blood glucose ranges on the Ohio Type 1 Diabetes dataset, indicating strong zero-shot generalization. Additionally, we find that bootstrapping a new patient with a physiologically-matched reference model accelerates convergence of effective dosing policies across in-silico cohorts of children, adolescents, and adults. These findings suggest that leveraging response-group-specific synthetic experience can expedite personalized insulin therapy, offering a promising pathway towards clinical validation.

Bootstrapping Personalized Insulin Therapy via Model-Based Reinforcement Learning: An In Silico Study

Existing approaches to complaint analysis largely rely on unimodal, short-form content such as tweets or product reviews. This work advances the field by leveraging multimodal, multi-turn customer support dialogues—where users often share both textual complaints and visual evidence (e.g., screenshots, product photos)—to enable fine-grained classification of complaint aspects and severity. We introduce $\textit{VALOR}$, a Validation-Aware Learner with Expert Routing, tailored for this multimodal setting. It employs a multi-expert reasoning setup using large-scale generative models with Chain-of-Thought (CoT) prompting for nuanced decision-making. To ensure coherence between modalities, a semantic alignment score is computed and integrated into the final classification through a meta-fusion strategy. In alignment with the United Nations Sustainable Development Goals (UN SDGs), the proposed framework supports SDG 9 (Industry, Innovation and Infrastructure) by advancing AI-driven tools for robust, scalable, and context-aware service infrastructure. Further, by enabling structured analysis of complaint narratives and visual context, it contributes to SDG 12 (Responsible Consumption and Production) by promoting more responsive product design and improved accountability in consumer services. We evaluate $\textit{VALOR}$ on a curated multimodal complaint dataset annotated with fine-grained aspect and severity labels, showing that it consistently outperforms baseline models, especially in complex complaint scenarios where information is distributed across text and images. This study underscores the value of multimodal interaction and expert validation in practical complaint understanding systems. Resources related to data and codes are available here: https://anonymous.4open.science/r/672.

Talk, Snap, Complain: Validation-Aware Multimodal Expert Framework for Fine-Grained Customer Grievances

Although Large language models (LLMs) are increasingly implicated in interpersonal and societal decision-making, their ability to navigate explicit conflicts between legitimately different cultural value systems remains largely unexamined. Existing benchmarks predominantly target cultural knowledge (CulturalBench), value prediction (WorldValuesBench), or single-axis bias diagnostics (CDEval); none, however, evaluate how LLMs adjudicate when multiple culturally grounded values directly clash. We address this gap with CCD-Bench, a benchmark that assesses LLM decision-making under overt cross-cultural value conflict. CCD-Bench comprises 2,182 open-ended dilemmas spanning seven domains, each paired with exactly ten anonymized response options corresponding to the ten GLOBE cultural clusters, which represent the organizational behavior of 62 societies. These dilemmas are presented using a Stratified Latin Square to mitigate ordering effects. We evaluate 17 leading non-reasoning LLMs. LLMs disproportionately prefer Nordic Europe (mean 20.2\%) and Germanic Europe (12.4\%), while the options for Eastern Europe and the Middle East \& North Africa are underrepresented (5.6–5.8\%). Although 87.9\% of rationales reference two or more GLOBE dimensions, this apparent pluralism is largely superficial: LLMs repeatedly recombine a narrow subset of Future Orientation and Performance Orientation, and rarely ground choices in Assertiveness or Gender Egalitarianism (both $<$3\%). Ordering effects are negligible (Cramér’s $V < 0.10$), and symmetrized KL divergence indicates LLMs clustering by developer lineage rather than geography. Taken together, these patterns suggest that contemporary alignment pipelines encourage a consensus-oriented, progress-centric worldview that underserves scenarios demanding explicit power negotiation, rights-based reasoning, or gender-aware analysis. CCD-Bench thus shifts evaluation from isolated bias detection to pluralistic decision making, revealing that current LLMs maintain Western-centric, consensus-oriented preferences even when confronted with ten equally valid, culturally diverse alternatives, and underscoring the need for alignment strategies that substantively engage with diverse worldviews.

CCD-Bench: Probing Cultural Conflict in Large Language Model Decision-Making

Classifiers trained on historical data are deployed in the real-world to automate decisions from hiring to loan issuance. Judging the fairness and efficiency of these systems, and their human counterparts, is a complex and important topic studied across both computational and social sciences. One common way to address bias in classifiers is to resample the training data to offset distributional disparities. In the hiring domain, where results may vary by a protected class, many interventions from the literature equalize the hiring rate within the training set to alleviate bias in the resulting classifier. While simple and seemingly effective, these methods have typically only been evaluated using data obtained through convenience samples, e.g., results of some real world hiring process, introducing selection and label bias into the evaluation. In the social and health sciences, audit studies, in which fictitious ``testers'' (resumes) are sent to subjects (job openings) in a randomized control trial, provide high quality data that support rigorous estimates of discrimination by controlling for confounding factors. In this paper, we investigate how data from audit studies can be used to improve our ability to both train and evaluate automated hiring algorithms. We find that audit data of real-world hiring reveals cases where equalizing base rates across classes \emph{appears} to achieve parity using traditional measures, but in fact has $\approx$ 10\% disparity when measured appropriately. We also show that corrections based on individual treatment effect estimation methods combined with audit study data can overcome these issues, underscoring the need for rigorous data collection in fairness research.

The Illusion of Fairness: Auditing Fairness Interventions in Algorithmic Hiring with Audit Studies

We introduce PandemIQ Llama, a domain-adapted large language model (LLM) designed specifically for pandemic intelligence applications. Building on the pre-trained Llama-3.1-8B model, we conducted continuous training using our curated Pandemic Corpus. This dataset was assembled from authoritative public health sources, scientific literature, and specialized knowledge repositories, comprising 508,924 documents totaling 5.8 billion tokens, which is the largest pandemic domain specific data cohort for LLM training. 
Benefited from our curated large data cohorts and through continuous training leveraging extensive computational resources, the developed PandemIQ Llama model can extract critical domain knowledge on pandemic, which is typically underrepresented in general-purpose language models, To evaluate its performance, we conducted comprehensive comparison of PandemIQ Llama with both prompt-engineered and task-specific fine-tuned baseline models using two tasks: the Biomedical Alert News Question Answering task (1,508 reports with 30 expert-generated questions each) and the Disease Event Type Classification benchmark (4,500 news snippets across eight disease categories). PandemIQ Llama demonstrated substantial improvements over strong baseline models, achieving performance gains ranging from 3.8% to 10.97%. These results suggest that PandemIQ Llama could significantly enhance public health surveillance and analysis capabilities. In addition, our result also suggests that the LLMs can perform better with continuous training than fine-tuning on domain specific tasks. Social Impact: This model will be integrated with Epidemic Intelligence from Open Sources (EIOS) run by World Health Organization (WHO). This integration will empower a large community of decision makers and stakeholders in all WHO member countries with the first LLM-based AI tool for pandemic surveillance.

PandemIQ Llama: A Domain-Adapted Foundation Model for Enhanced Pandemic Intelligence

Investigating the effects of climate change and global warming caused by GHG emissions have been a central focus worldwide. These emissions are largely contributed to by the production, use and disposal of consumer products. Thus, it is important to build tools to estimate the environmental impact of consumer goods, an essential part of which is conducting Life Cycle Assessments (LCAs). LCAs specify and account for the appropriate processes involved with the production, use, and disposal of the products. We present SpiderGen, an LLM-based workflow which integrates the taxonomy and methodology of traditional LCA with the reasoning capabilities and world knowledge of LLMs to generate this procedural information used for LCA. We additionally apply evaluation methods for this use-case, and evaluate the output of SpiderGen with real-world LCA documents. We find that SpiderGen provides accurate LCA process information that is either fully correct or has minor errors on average 60\% of the time. We observe that the remaining missed processes and hallucinated errors occur primarily due to differences in detail between LCA documents, as well as differences in the understanding of ``scope" of which auxiliary processes must also be included. We also demonstrate that SpiderGen performs better than several baselines techniques, such as chain-of-thought prompting and one-shot prompting. Finally, we highlight that SpiderGen has the potential to drastically reduce the human effort and costs for estimating carbon impact, as it is able to produce LCA process information for less than \$1 USD in under 10 minutes as compared to the status quo LCA, which costs over \$25000 USD and take up to 21-person days.

SpiderGen: Towards Procedure Generation for Carbon Life Cycle Assessments with Generative AI

Equitable formative feedback remains out of reach for large or low-resource courses because instructors cannot read every learner reflection. We present a theory-grounded pipeline of five role-based LLM agents—Evaluator, Equity Monitor, Metacognitive Coach, Aggregator, and Reflexion Reviewer—that jointly produce calibrated rubric scores and $\le 120$-word, bias-aware comments.
On $84$ reflections from a $12$-session AI-literacy program, the pipeline matches expert raters ($\mathrm{MAE}=0.47$, $\mathrm{QWK}=0.46$, human-AI $\mathrm{ICC}=0.41$) while bounding the worst-ability error gap to $\Delta{\mathrm{MAE}}=0.50$.
Automated scoring finishes in $7.7\text{s}$ per reflection—an $11\times$ speed-up over the human mean of $1.4\text{min}$—and complete feedback in $33\text{s}$.
A full agentic run costs just $0.0016$ per reflection.
Three trained graders rated the AI feedback highly useful (overall $Q(g)=3.97/5$) with top marks for empathy ($4.22/5$).
Contributions. (i) A self-consistent scoring scheme with equity safeguards; (ii) a role-based agent ensemble for dialogic, bias-aware feedback; and (iii) the first open dataset, prompts, and codebase for equitable reflection assessment. Together, these advances demonstrate a practical path toward large-scale, fair feedback in real classrooms.

Downloads

Next from AAAI 2026

BSAN: Behavioral State Attention Network for Modeling Mosquito Host-Seeking Behavior

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

BSAN: Behavioral State Attention Network for Modeling Mosquito Host-Seeking Behavior

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads