Singapore

Studies of LLMs’ political opinions mainly evaluate their open-ended responses. Recent work indicates misalignment between LLMs responses and their internal intentions. This motivates us to probe LLMs&#39; internal mechanisms and uncover their internal political states. Additionally, analysis of LLMs&#39; political opinions often relies on single-axis concepts, which can lead to concept confounds.

Our work extends this to multi-dimensions and applies interpretable techniques for more transparent LLM political concept learning. Specifically, we designed a four-dimensional political learning framework and constructed a corresponding dataset for fine-grained political concept vector learning. These vectors can detect and intervene in LLM internals. 

Experiments are conducted on eight open-source LLMs with three representation engineering techniques. Results show these vectors can disentangle political concept confounds. Detection tasks validate the semantic meaning of the vectors and show good generalization and robustness in OOD settings. Intervention experiments show that these vectors can implicitly intervene in LLMs, generating responses with targeted political leanings. These insights reveal the need for more transparent auditing for future AI governance.

AAAI 2026

Fine-Grained Interpretation of Political Opinions in Large Language Models

peai: accountability

peai: artificial general intelligence

interpretability & explainability

peai: societal impact of ai

Studies of LLMs’ political opinions mainly evaluate their open-ended responses. Recent work indicates misalignment between LLMs responses and their internal intentions. This motivates us to probe LLMs' internal mechanisms and uncover their internal political states. Additionally, analysis of LLMs' political opinions often relies on single-axis concepts, which can lead to concept confounds.

Our work extends this to multi-dimensions and applies interpretable techniques for more transparent LLM political concept learning. Specifically, we designed a four-dimensional political learning framework and constructed a corresponding dataset for fine-grained political concept vector learning. These vectors can detect and intervene in LLM internals. 

Experiments are conducted on eight open-source LLMs with three representation engineering techniques. Results show these vectors can disentangle political concept confounds. Detection tasks validate the semantic meaning of the vectors and show good generalization and robustness in OOD settings. Intervention experiments show that these vectors can implicitly intervene in LLMs, generating responses with targeted political leanings. These insights reveal the need for more transparent auditing for future AI governance.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

AI models are often evaluated based on their ability to predict the outcome of interest. However, in many AI for social impact applications, the presence of an intervention that affects the outcome can bias the evaluation. Randomized controlled trials (RCTs) randomly assign interventions, so data from the control group can be used to generate unbiased model performance estimates. However, this approach is inefficient because it ignores data from the treatment group. Given the complexity and cost often associated with RCTs, making the most use of the data is essential. Thus, we investigate model evaluation strategies that leverage all data from an RCT. First, we theoretically quantify estimation bias from naïvely aggregating performance estimates with data from treatment and control groups, and derive the condition under which the bias leads to incorrect model selection. Leveraging these theoretical insights, we propose an unbiased evaluation strategy that reweights data from the treatment group to mimic the distributions of samples that would or would not experience the outcome under no intervention. Using synthetic and real-world datasets, we show that our proposed evaluation approach consistently leads to better model selection than the standard approach that ignores data from the treatment group across various intervention effect and sample size settings. Our contribution represents a meaningful step towards more efficient model evaluation in real-world contexts.

Measuring Model Performance in the Presence of an Intervention

Personalized AI applications such as DreamBooth enable the generation of customized content from user images, but they also raise significant privacy concerns, particularly the risk of facial identity leakage. Recent defense mechanisms like Anti-DreamBooth attempt to mitigate this risk by injecting adversarial perturbations into user photos to prevent successful personalization. However, we identify two critical yet overlooked limitations of these methods. First, the adversarial examples often exhibit perceptible artifacts such as conspicuous patterns or stripes, making them easily detectable as manipulated content. Second, the perturbations are highly fragile, as even simple, non-learned filtering operations can effectively remove them, thereby restoring the model's ability to memorize and reproduce the user's identity. To investigate this vulnerability, we conduct a comprehensive evaluation of existing defenses under realistic purification threats, including both traditional image filters and adversarial purification methods. Our results reveal that none of the current methods maintains their protective effectiveness under such threats. These findings highlight that current defenses offer a false sense of security and underscore the urgent need for more imperceptible and robust protections to safeguard user identity in personalized generation.

Fragile by Design: On the Limits of Adversarial Defenses in Personalized DreamBooth Generation

Emergency Medical Services (EMS) are critical to patient survival in emergencies, but first responders often face intense cognitive demands in high-stakes situations. AI cognitive assistants, acting as virtual partners, have the potential to ease this burden by supporting real-time data collection and decision making. In pursuit of this vision, we introduce EgoEMS, the first end-to-end, high-fidelity, multimodal, multiperson dataset capturing over 20 hours of realistic, procedural EMS activities from an egocentric view in 233 simulated emergency scenarios performed by 62 participants, including 46 EMS professionals. Developed in collaboration with EMS experts and aligned with national standards, EgoEMS is captured using an open-source, low-cost, and replicable data collection system and is annotated with keysteps, timestamped audio transcripts with speaker diarization, action quality metrics, and bounding boxes with segmentation masks. Emphasizing realism, the dataset includes responder-patient interactions reflecting real-world emergency dynamics. 
We also present a suite of benchmarks for real-time multimodal keystep recognition and action quality estimation, essential for developing AI support tools for EMS. 
We hope EgoEMS inspires the research community to push the boundaries of intelligent EMS systems and ultimately contribute to improved patient outcomes.
$\textit{The dataset and codebase will be made publicly available upon publication.}$

EgoEMS: A High-Fidelity Multimodal Egocentric Dataset for Cognitive Assistance in Emergency Medical Services

About 25\% of the world’s population live in informal urban settlements containing densely packed buildings (approximately 8,000 houses per $km^2$) which do not lend themselves favorably to state-of-the-art satellite-based building segmentation methods due to, for example, occlusion, vegetation, shadows and low resolution. To address these challenges, we introduce a novel instance segmentation and counting approach for dense buildings. Our system first extracts a conservative set of tentative building center points using a deep network for jumpstarting a Segment Anything Model 2 (SAM2) module to produce an initial over-segmentation. Second, we use a graph neural network to refine the over-segmented regions into polygons representing accurate building masks. Experiments show that our approach achieves higher accuracy in instance segmentation and counting especially in challenging densely packed buildings areas in Brazil, Mexico, India, Pakistan, and Kenya, for instance.

Building Instance Segmentation for Dense Urban Settlements

Language acquisition is vital to reveal the nature of human language intelligence and has recently emerged as a promising lens for improving the interpretability of large language models (LLMs). However, due to ethical and practical constraints, many experiments that require controlling language inputs remain infeasible with human learners. This poses challenges for the verifiability and scalability of language acquisition modeling, particularly in Chinese second language acquisition (SLA). While LLMs provide a controllable and reproducible alternative, a systematic benchmark to support phase-wise modeling and assessment is still lacking. To address these issues, we propose HSKBenchmark, the first benchmark for staged modeling and writing assessment of LLMs in Chinese SLA. It spans HSK levels 3 to 6, comprising authentic textbooks with 6.76M tokens, 16K synthetic instruction data, 30 test titles and a linguistically-grounded evaluation system. To simulate human acquisition trajectories, a curriculum-tuning framework is introduced, which trains LLMs in a progression from beginner to advanced materials. In addition, since language production in writing is a key perspective for observing SLA development, we establish the evaluation system including the coverage of level-based grammar items, writing errors, lexical complexity, syntactic complexity, and holistic scoring to probe LLMs in writing. We also develop an HSKAgent fine-tuned on 10K compositions from Chinese second language learners to automate this evaluation system. Extensive experimental results demonstrate that HSKBenchmark not only models Chinese SLA effectively, but also serves as a reliable benchmark for dynamic writing assessment in LLMs. Our fine-tuned LLMs have writing performance on par with advanced human learners and exhibit human-like acquisition characteristics. The HSKBenchmark, HSKAgent, and checkpoints serve as foundational tools and resources, paving the way for future research on language acquisition modeling and LLMs interpretability. Code and data are publicly available at: https://anonymous.4open.science/r/HSKB-5E70.

HSKBenchmark: Modeling and Benchmarking Chinese Second Language Acquisition in Large Language Models Through Curriculum Tuning

Data filtering strategies are a crucial component to develop safe Large Language Models (LLM), since they support the removal of harmful contents from pretraining datasets. There is a lack of research on the actual impact of these strategies on vulnerable groups to discrimination, though, and their effectiveness has not been yet systematically addressed. In this paper we present a benchmark study of data filtering strategies for harm reduction aimed at providing a systematic evaluation on these approaches. We provide an overview 55 technical reports of English LMs and LLMs to identify the existing filtering strategies in literature and implement an experimental setting to test their impact against vulnerable groups. Our results show that the positive impact that strategies have in reducing harmful contents from documents has the side effect of increasing the underrepresentation of vulnerable groups to discrimination in datasets.

What Are They Filtering Out? An Experimental Benchmark of Filtering Strategies for Harm Reduction in Pretraining Datasets

Traffic prediction serves as a cornerstone of modern intelligent transportation systems and the critical task of spatio-temporal forecasting. 
Although advanced Spatio-temporal Graph Neural Networks (STGNNs) and pre-trained models have made significant progress in traffic prediction, two critical challenges persist: (i) limited contextual capacity when handling complex spatio-temporal dependencies, and (ii) low predictability at fine-grained spatio-temporal points caused by heterogeneous patterns.
Inspired by Retrieval-Augmented Generation (RAG), we propose **RAST**, a universal framework that integrates retrieval-augmented mechanisms with spatio-temporal modeling to address these challenges.
Our framework consists of three key designs: 1) Decoupled Encoder and Query Generator to capture decoupled spatial and temporal features and construct a fusion query via residual fusion; 2) Spatio-temporal Retrieval Store and Retrievers to maintain and retrieve vectorized fine-grained patterns; and 3) Universal Backbone Predictor that flexibly accommodates pre-trained STGNNs or simple MLP predictors. 
Extensive experiments on 6 real-world traffic networks, including large-scale datasets, demonstrate that RAST achieves superior performance while maintaining computational efficiency.

A Retrieval Augmented Spatio-Temporal Framework for Traffic Prediction

As machine learning systems become increasingly integrated into human-centered domains such as healthcare, ensuring fairness while maintaining high predictive performance is critical. Existing bias mitigation techniques often impose a trade-off between fairness and accuracy, inadvertently degrading performance for certain demographic groups. In high-stakes domains like clinical diagnosis, such trade-offs are ethically and practically unacceptable. In this study, we propose a fairness-without-harm approach by learning distinct representations for different demographic groups and selectively applying demographic experts consisting of group-specific representations and personalized classifiers through a no-harm constrained selection. We evaluate our approach on three real-world medical datasets—covering eye disease, skin cancer, and X-ray diagnosis—as well as two face datasets. Extensive empirical results demonstrate the effectiveness of our approach in achieving fairness without harm.

Achieving Fairness Without Harm via Selective Demographic Experts

Fairness studies of algorithmic decision-making systems often simplify complex decision processes, such as bail or lending decisions, into binary classification tasks (e.g., approve or not approve). However, these approaches overlook that such decisions are not inherently binary; they also involve non-binary treatment decisions (e.g., loan or bail terms) that can influence the downstream outcomes (e.g., loan repayment or reoffending). We argue that treatment decisions are integral to the decision-making process and, therefore, should be central to fairness analyses.
Consequently, we propose a causal framework that extends and complements existing fairness notions by explicitly distinguishing between decision-subjects’ covariates and the treatment decisions. 
Our framework leverages path-specific counterfactual reasoning to: 
(i) measure treatment disparity and its downstream effects in historical data; and (ii) mitigate the impact of past unfair treatment decisions when automating decision-making. We use our framework to empirically analyze four widely used loan approval datasets to reveal potential disparity in non-binary treatment decisions and their discriminatory impact on outcomes, highlighting the need to incorporate treatment decisions in fairness assessments. Finally, by intervening in treatment decisions, we show that our framework effectively mitigates treatment discrimination from historical loan approval data to ensure fair risk score estimation and (non-binary) decision-making processes that benefit all stakeholders.

A Causal Framework to Measure and Mitigate Non-binary Treatment Discrimination

As AI moves into high-stakes, human-centered settings, we still lack clear evidence on when and why these systems succeed or fail. This meta-analysis synthesizes all empirical studies published between 2022 and 2025 that use social-media data to predict depression, quantifying pooled accuracy and testing study-level moderators. By showing how data sources and model architecture shape outcomes, we offer an empirical foundation for a more reliable, socially aware deployment of AI in mental health.

Across 67 studies, overall performance is strong (pooled r ≈ 0.80) and climbs even higher in 2024, driven by deep, transformer-based and multimodal systems. The gains, however, are uneven: post-level binary detectors improve the most, user-level severity estimation still lags, and results hinge as much on label provenance and platform context as on model size—highlighting a persistent gap between leaderboard success and clinically meaningful reliability.

To address that gap, we propose a Psych-Aligned Evaluation Framework that maps predictions onto validated symptom dimensions and adds three deployment-critical tests—PHQ error, temporal stability, and clinician agreement. This framework converts single-number benchmarks into a multidimensional yardstick for real-world, psychologically meaningful depression detection.

Content not yet available

Next from AAAI 2026

Measuring Model Performance in the Presence of an Intervention

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES