Singapore

Although Large language models (LLMs) are increasingly implicated in interpersonal and societal decision-making, their ability to navigate explicit conflicts between legitimately different cultural value systems remains largely unexamined. Existing benchmarks predominantly target cultural knowledge (CulturalBench), value prediction (WorldValuesBench), or single-axis bias diagnostics (CDEval); none, however, evaluate how LLMs adjudicate when multiple culturally grounded values directly clash. We address this gap with CCD-Bench, a benchmark that assesses LLM decision-making under overt cross-cultural value conflict. CCD-Bench comprises 2,182 open-ended dilemmas spanning seven domains, each paired with exactly ten anonymized response options corresponding to the ten GLOBE cultural clusters, which represent the organizational behavior of 62 societies. These dilemmas are presented using a Stratified Latin Square to mitigate ordering effects. We evaluate 17 leading non-reasoning LLMs. LLMs disproportionately prefer Nordic Europe (mean 20.2\%) and Germanic Europe (12.4\%), while the options for Eastern Europe and the Middle East \&amp; North Africa are underrepresented (5.6–5.8\%). Although 87.9\% of rationales reference two or more GLOBE dimensions, this apparent pluralism is largely superficial: LLMs repeatedly recombine a narrow subset of Future Orientation and Performance Orientation, and rarely ground choices in Assertiveness or Gender Egalitarianism (both $&lt;$3\%). Ordering effects are negligible (Cramér’s $V &lt; 0.10$), and symmetrized KL divergence indicates LLMs clustering by developer lineage rather than geography. Taken together, these patterns suggest that contemporary alignment pipelines encourage a consensus-oriented, progress-centric worldview that underserves scenarios demanding explicit power negotiation, rights-based reasoning, or gender-aware analysis. CCD-Bench thus shifts evaluation from isolated bias detection to pluralistic decision making, revealing that current LLMs maintain Western-centric, consensus-oriented preferences even when confronted with ten equally valid, culturally diverse alternatives, and underscoring the need for alignment strategies that substantively engage with diverse worldviews.

AAAI 2026

CCD-Bench: Probing Cultural Conflict in Large Language Model Decision-Making

nlp: (large) language models hai: learning human values and preferences peai: bias

fairness & equity

Although Large language models (LLMs) are increasingly implicated in interpersonal and societal decision-making, their ability to navigate explicit conflicts between legitimately different cultural value systems remains largely unexamined. Existing benchmarks predominantly target cultural knowledge (CulturalBench), value prediction (WorldValuesBench), or single-axis bias diagnostics (CDEval); none, however, evaluate how LLMs adjudicate when multiple culturally grounded values directly clash. We address this gap with CCD-Bench, a benchmark that assesses LLM decision-making under overt cross-cultural value conflict. CCD-Bench comprises 2,182 open-ended dilemmas spanning seven domains, each paired with exactly ten anonymized response options corresponding to the ten GLOBE cultural clusters, which represent the organizational behavior of 62 societies. These dilemmas are presented using a Stratified Latin Square to mitigate ordering effects. We evaluate 17 leading non-reasoning LLMs. LLMs disproportionately prefer Nordic Europe (mean 20.2\%) and Germanic Europe (12.4\%), while the options for Eastern Europe and the Middle East \& North Africa are underrepresented (5.6–5.8\%). Although 87.9\% of rationales reference two or more GLOBE dimensions, this apparent pluralism is largely superficial: LLMs repeatedly recombine a narrow subset of Future Orientation and Performance Orientation, and rarely ground choices in Assertiveness or Gender Egalitarianism (both $<$3\%). Ordering effects are negligible (Cramér’s $V < 0.10$), and symmetrized KL divergence indicates LLMs clustering by developer lineage rather than geography. Taken together, these patterns suggest that contemporary alignment pipelines encourage a consensus-oriented, progress-centric worldview that underserves scenarios demanding explicit power negotiation, rights-based reasoning, or gender-aware analysis. CCD-Bench thus shifts evaluation from isolated bias detection to pluralistic decision making, revealing that current LLMs maintain Western-centric, consensus-oriented preferences even when confronted with ten equally valid, culturally diverse alternatives, and underscoring the need for alignment strategies that substantively engage with diverse worldviews.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Classifiers trained on historical data are deployed in the real-world to automate decisions from hiring to loan issuance. Judging the fairness and efficiency of these systems, and their human counterparts, is a complex and important topic studied across both computational and social sciences. One common way to address bias in classifiers is to resample the training data to offset distributional disparities. In the hiring domain, where results may vary by a protected class, many interventions from the literature equalize the hiring rate within the training set to alleviate bias in the resulting classifier. While simple and seemingly effective, these methods have typically only been evaluated using data obtained through convenience samples, e.g., results of some real world hiring process, introducing selection and label bias into the evaluation. In the social and health sciences, audit studies, in which fictitious ``testers'' (resumes) are sent to subjects (job openings) in a randomized control trial, provide high quality data that support rigorous estimates of discrimination by controlling for confounding factors. In this paper, we investigate how data from audit studies can be used to improve our ability to both train and evaluate automated hiring algorithms. We find that audit data of real-world hiring reveals cases where equalizing base rates across classes \emph{appears} to achieve parity using traditional measures, but in fact has $\approx$ 10\% disparity when measured appropriately. We also show that corrections based on individual treatment effect estimation methods combined with audit study data can overcome these issues, underscoring the need for rigorous data collection in fairness research.

The Illusion of Fairness: Auditing Fairness Interventions in Algorithmic Hiring with Audit Studies

We introduce PandemIQ Llama, a domain-adapted large language model (LLM) designed specifically for pandemic intelligence applications. Building on the pre-trained Llama-3.1-8B model, we conducted continuous training using our curated Pandemic Corpus. This dataset was assembled from authoritative public health sources, scientific literature, and specialized knowledge repositories, comprising 508,924 documents totaling 5.8 billion tokens, which is the largest pandemic domain specific data cohort for LLM training. 
Benefited from our curated large data cohorts and through continuous training leveraging extensive computational resources, the developed PandemIQ Llama model can extract critical domain knowledge on pandemic, which is typically underrepresented in general-purpose language models, To evaluate its performance, we conducted comprehensive comparison of PandemIQ Llama with both prompt-engineered and task-specific fine-tuned baseline models using two tasks: the Biomedical Alert News Question Answering task (1,508 reports with 30 expert-generated questions each) and the Disease Event Type Classification benchmark (4,500 news snippets across eight disease categories). PandemIQ Llama demonstrated substantial improvements over strong baseline models, achieving performance gains ranging from 3.8% to 10.97%. These results suggest that PandemIQ Llama could significantly enhance public health surveillance and analysis capabilities. In addition, our result also suggests that the LLMs can perform better with continuous training than fine-tuning on domain specific tasks. Social Impact: This model will be integrated with Epidemic Intelligence from Open Sources (EIOS) run by World Health Organization (WHO). This integration will empower a large community of decision makers and stakeholders in all WHO member countries with the first LLM-based AI tool for pandemic surveillance.

PandemIQ Llama: A Domain-Adapted Foundation Model for Enhanced Pandemic Intelligence

Investigating the effects of climate change and global warming caused by GHG emissions have been a central focus worldwide. These emissions are largely contributed to by the production, use and disposal of consumer products. Thus, it is important to build tools to estimate the environmental impact of consumer goods, an essential part of which is conducting Life Cycle Assessments (LCAs). LCAs specify and account for the appropriate processes involved with the production, use, and disposal of the products. We present SpiderGen, an LLM-based workflow which integrates the taxonomy and methodology of traditional LCA with the reasoning capabilities and world knowledge of LLMs to generate this procedural information used for LCA. We additionally apply evaluation methods for this use-case, and evaluate the output of SpiderGen with real-world LCA documents. We find that SpiderGen provides accurate LCA process information that is either fully correct or has minor errors on average 60\% of the time. We observe that the remaining missed processes and hallucinated errors occur primarily due to differences in detail between LCA documents, as well as differences in the understanding of ``scope" of which auxiliary processes must also be included. We also demonstrate that SpiderGen performs better than several baselines techniques, such as chain-of-thought prompting and one-shot prompting. Finally, we highlight that SpiderGen has the potential to drastically reduce the human effort and costs for estimating carbon impact, as it is able to produce LCA process information for less than \$1 USD in under 10 minutes as compared to the status quo LCA, which costs over \$25000 USD and take up to 21-person days.

SpiderGen: Towards Procedure Generation for Carbon Life Cycle Assessments with Generative AI

Equitable formative feedback remains out of reach for large or low-resource courses because instructors cannot read every learner reflection. We present a theory-grounded pipeline of five role-based LLM agents—Evaluator, Equity Monitor, Metacognitive Coach, Aggregator, and Reflexion Reviewer—that jointly produce calibrated rubric scores and $\le 120$-word, bias-aware comments.
On $84$ reflections from a $12$-session AI-literacy program, the pipeline matches expert raters ($\mathrm{MAE}=0.47$, $\mathrm{QWK}=0.46$, human-AI $\mathrm{ICC}=0.41$) while bounding the worst-ability error gap to $\Delta{\mathrm{MAE}}=0.50$.
Automated scoring finishes in $7.7\text{s}$ per reflection—an $11\times$ speed-up over the human mean of $1.4\text{min}$—and complete feedback in $33\text{s}$.
A full agentic run costs just $0.0016$ per reflection.
Three trained graders rated the AI feedback highly useful (overall $Q(g)=3.97/5$) with top marks for empathy ($4.22/5$).
Contributions. (i) A self-consistent scoring scheme with equity safeguards; (ii) a role-based agent ensemble for dialogic, bias-aware feedback; and (iii) the first open dataset, prompts, and codebase for equitable reflection assessment. Together, these advances demonstrate a practical path toward large-scale, fair feedback in real classrooms.

Scaling Equitable Reflection Assessment in Education via Large Language Models and Role-Based Feedback Agents

Social media memes are a challenging domain for hate detection because they intertwine visual and textual cues into culturally nuanced messages. To tackle these challenges, we introduce TRACE, a hierarchical multimodal framework that leverages visually grounded context augmentation, along with a novel caption-scoring network to emphasize hate-relevant content, and parameter-efficient fine-tuning of CLIP’s text encoder. Our experiments demonstrate that selectively fine-tuning deeper text encoder layers significantly enhances performance compared to simpler projection-layer fine-tuning methods. Specifically, our framework achieves state-of-the-art accuracy (0.807) and F1-score (0.806) on the widely-used Hateful Memes dataset, matching the performance of considerably larger models while maintaining efficiency. Moreover, it achieves superior generalization on the MultiOFF offensive meme dataset (F1-score 0.673), highlighting robustness across meme categories. Additional analyses confirm that robust visual grounding and nuanced text representations significantly reduce errors caused by benign confounders. We will publicly release our code and models to facilitate future research.

TRACE: Textual Relevance Augmentation and Contextual Encoding for Multimodal Hate Detection

This paper presents the largest known benchmark dataset for road damage assessment and road alignment, and provides 18 baseline models trained on the CRASAR-U-DRIODs dataset’s post-disaster small uncrewed aerial systems (sUAS) imagery from 10 federally declared disasters, addressing three challenges within prior post-disaster road damage assessment datasets. While prior disaster road damage assessment datasets exist, there is no current state of practice, as prior public datasets have either been small-scale or reliant on low-resolution imagery insufficient for detecting phenomena of interest to emergency managers. Further, while machine learning (ML) systems have been developed for this task previously, none are known to have been operationally validated. These limitations are overcome in this work through the labeling of 657.25km of roads according to a 10-class labeling schema, followed by training and deploying ML models during the operational response to Hurricanes Debby and Helene in 2024. Motivated by observed road line misalignment in practice, 9,184 road line adjustments were provided for spatial alignment of a priori road lines, as it was found that when the 18 baseline models are deployed against real-world misaligned road lines, model performance degraded on average by 5.596% Macro IoU. If spatial alignment is not considered, approximately 8% (11km) of adverse conditions on road lines will be labeled incorrectly, with approximately 9% (59km) of road lines misaligned off the actual road. These dynamics are gaps that should be addressed by the ML, CV, and robotics
communities to enable more effective and informed decision-making during disasters.

A Benchmark Dataset for Spatially Aligned Road Damage Assessment in Small Uncrewed Aerial Systems Disaster Imagery

Integrating Large Language Models (LLMs) into judicial decision-making demands rigorous safety examination against non-legal influences. This paper presents a novel stress test where we evaluate LLM-generated labor dispute outcomes by introducing social media sentiment as an external pressure, critically comparing them against 10,000 real-world court judgments from China Judgments Online (CJOL). Our findings reveal significant LLM safety vulnerabilities: models exhibit inherent deviations from real rulings, and public opinion substantially amplifies these discrepancies, leading to unstable and often inflated compensation predictions. Furthermore, these safety risks are compounded across low-skilled occupational categories and emotionally charged topics. This study uncovers critical threats to judicial integrity and public trust, underscoring the urgent need for robust safeguards against non-legal influences in AI legal systems.

LLM Safety in Judicial AI: A Stress Test of Social Media Influence on Real-World Judgments

Sustainability is becoming increasingly critical in the maritime transport, encompassing both environmental and social impacts, such as Greenhouse Gas (GHG) emissions and navigational safety. Traditional vessel navigation heavily relies on human experience, often lacking autonomy and emission awareness, and is prone to human errors that may compromise safety. In this paper, we propose a Curriculum Reinforcement Learning (CRL) framework integrated with a realistic, data-driven marine simulation environment and a machine learning-based fuel consumption prediction module. The simulation environment is constructed using real-world vessel movement data and enhanced with a Diffusion Model to simulate dynamic maritime conditions. Vessel fuel consumption is estimated using historical operational data and learning-based regression. The surrounding environment is represented as image-based inputs to capture spatial complexity. We design a lightweight, policy-based CRL agent with a comprehensive reward mechanism that considers safety, emissions, timeliness, and goal completion. This framework effectively handles complex tasks progressively while ensuring stable and efficient learning in continuous action spaces. We validate the proposed approach in a sea area of the Indian Ocean, demonstrating its efficacy in enabling sustainable and safe vessel navigation.

Realistic Curriculum Reinforcement Learning for Autonomous and Sustainable Marine Vessel Navigation

The opioid crisis represents a significant moment in public health that reveals systemic shortcomings across regulatory systems, healthcare practices, corporate governance, and public policy. Analyzing how these interconnected systems simultaneously failed to protect public health requires innovative analytic approaches for exploring the vast amounts of data and documents disclosed in the UCSF-JHU Opioid Industry Documents Archive (OIDA).
The complexity, multimodal nature, and specialized characteristics of these healthcare-related legal and corporate documents necessitate more advanced methods and models tailored to specific data types and detailed annotations, ensuring the precision and professionalism in the analysis.
In this paper, we tackle this challenge by organizing the original dataset according to document attributes and constructing a benchmark with 400k training documents and 10k for testing. 
From each document, we extract rich multimodal information—including textual content, visual elements, and layout structures—to capture a comprehensive range of features. 
Using multiple AI models, we then generate a large-scale dataset comprising 360k training QA pairs and 10k testing QA pairs.
Building on this foundation, we develop domain-specific multimodal Large Language Models (LLMs) and explore the impact of multimodal inputs on task performance. To further enhance response accuracy, we incorporate historical QA pairs as contextual grounding for answering current queries.
Additionally, we incorporate page references within the answers and introduce an importance-based page classifier, further improving the precision and relevance of the information provided.
Preliminary results indicate the improvements with our AI assistant in document information extraction and question-answering tasks, 
highlighting the effectiveness of our benchmark in addressing the opioid crisis.
Data and models will be released for public research.

OIDA-QA: A Multimodal Benchmark for Analyzing the Opioid Industry Documents Archive

The diversity across populations and the variability between individuals have long posed a significant challenge in cognitive science. Although large language models (LLMs) have made notable progress in aligning with human values, faithfully capturing the high degree of diversity and uncertainty in human judgment remains an unresolved challenge.This study investigates whether computational models, or `proxy agents," can not only emulate human decision patterns but also systematically modulate them. We propose a framework wherein we first fine-tune BERT-based proxy agents to replicate both aggregate and individual-level human judgments on a large-scale moral dilemma dataset. We then hypothesize that stimuli identified as maximally divisive for these individualized agents will similarly elicit high disagreement among human participants. Through a human-in-the-loop experiment, we validate this hypothesis, demonstrating that agent-selected stimuli can predictably induce targeted divergence in human moral choices. Our findings provide empirical evidence that AI agents can bias human perceptual variability by strategically filtering information. We further analyze this induced moral divergence using a Bayesian framework and concept decomposition to identify the distinct conceptual dimensions driving individual differences. This work quantifies the potential for AI-driven cognitive modulation and underscores the urgent need for ethical guidelines to prevent the misuse of such capabilities.

Downloads

Next from AAAI 2026

The Illusion of Fairness: Auditing Fairness Interventions in Algorithmic Hiring with Audit Studies

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

The Illusion of Fairness: Auditing Fairness Interventions in Algorithmic Hiring with Audit Studies

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads