Singapore

Recursive prompting with large language models enables scalable synthetic dataset generation but introduces the risk of bias amplification. We investigate gender bias dynamics across three generations of recursive text generation using three complementary evaluation frameworks: rule-based pattern matching, embedding-based semantic similarity, and downstream task performance. Experiments with three initial bias levels (0.1, 0.3, 0.6) and four mitigation strategies reveal equilibrium dynamics rather than monotonic amplification. The low initial bias amplifies toward the model’s inherent bias level (+36%), whereas the high initial bias decays toward it (−26%). Among mitigation methods, contrastive augmentation, which introduces gender-swapped variants, achieves significant downstream bias reduction (98.8% for low initial bias and 91% on average) despite producing higher embedding-based bias scores. This paradox demonstrates that semantic similarity metrics may diverge from behavioral fairness outcomes, highlighting the need for multidimensional evaluation in responsible synthetic data generation.

AAAI 2026

Equilibrium Dynamics and Mitigation of Gender Bias in Synthetically Generated Data

workshop paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

To access this event page, you need to log in with the **email address you registered with**. Access credentials will be sent to your email from Underline - subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Knowledge Graphs (KGs) enable applications in various domains such as semantic search, recommendation systems, and natural language processing. KGs are often incomplete, missing entities and relations, an issue addressed by Knowledge Graph Completion (KGC) methods that predict missing elements. Different evaluation metrics, such as Mean Reciprocal Rank (MRR), Mean Rank (MR), and Hit@k (e.g., Hit@1), are commonly used to assess the performance of such KGC models. A major challenge in evaluating KGC models however, lies in comparing their performance across multiple datasets and metrics. A model may outperform others on one dataset but underperform on another, making it difficult to determine overall superiority. Moreover, even within a single dataset, different metrics such as MRR and Hit@1 can yield conflicting rankings, where one model excels in MRR while another performs better in Hit@1, further complicating model selection for downstream tasks. These inconsistencies hinder holistic comparisons and highlight the need for a unified meta-metric that integrates performance across all metrics and datasets to enable a more reliable and interpretable evaluation framework. To address this need, we propose KG \textit{E}valuation based on \textit{D}istance from \textit{A}verage \textit{S}olution (EDAS), a robust and interpretable meta-metric that synthesizes model performance across multiple datasets and diverse evaluation criteria into a single normalized score ($M_i \in [0,1]$). Unlike traditional metrics that focus on isolated aspects of performance, EDAS offers a global perspective that supports more informed model selection and promotes fairness in cross-dataset evaluation. Experimental results on benchmark datasets such as FB15k-237 and WN18RR demonstrate that EDAS effectively integrates multi-metric, multi-dataset performance into a unified ranking, offering a consistent, robust, and generalizable framework for evaluating KGC models.

KG-EDAS: A Meta-Metric Framework for Evaluating Knowledge Graph Completion Models

Dialogue State Tracking (DST) requires precise extraction of structured information from multi-domain conversations, a task where Large Language Models (LLMs) struggle despite their impressive general capabilities. We present GEM (Graph-Enhanced Mixture-of-Experts), a novel framework that combines language models and graph-structured dialogue understanding with ReAct agent-based reasoning for superior DST performance. Our approach dynamically routes between specialized experts: a Graph Neural Network that captures dialogue structure and turn-level dependencies, and a finetuned T5-Small encoder-decoder for sequence modeling, coordinated by an intelligent gating network. For complex value generation tasks, we integrate ReAct agents that perform structured reasoning over dialogue context. On MultiWOZ 2.2, GEM achieves 65.19% Joint Goal Accuracy, substantially outperforming end-to-end LLM approaches (best: 38.43%) and surpassing state-of-the-art (SOTA) methods including TOATOD (63.79%), D3ST (58.70%), and Diable (56.48%). Our graph-enhanced mixture-of-experts architecture with ReAct integration demonstrates that combining structured dialogue representation with dynamic expert routing and agent-based reasoning provides a powerful paradigm for dialogue state tracking, achieving superior accuracy while maintaining computational efficiency through selective expert activation.

GEM: Graph-Enhanced Mixture-of-Experts with ReAct Agents for Dialogue State Tracking

With the increase of data in day-to-day life, businesses and different stakeholders need to analyze the data for better predictions. Traditionally, relational data has been a source of various insights, but with the increase in computational power and the need to understand deeper relationships between entities, the need to design new techniques has arisen. For this graph data analysis has become an extraordinary tool for understanding the data, which reveals more realistic and flexible modelling of complex relationships. Recently, Graph Neural Networks (GNNs) have shown great promise in various applications, such as social network analysis, recommendation systems, drug discovery, and more. However, many adversarial attacks can happen over the data, whether during training (poisoning attack) or during testing (evasion attack), which can adversely manipulate the desired outcome from the GNN model. Therefore, it is crucial to make the GNNs robust to such attacks. The existing robustness methods are computationally demanding and perform poorly when the intensity of attack increases. This paper presents a computationally efficient framework, namely, pLAPGNN, based on weighted p-Laplacian for making GNNs robust. Empirical evaluation on real datasets establishes the efficacy and efficiency of the proposed method.

Enhancing Robustness of Graph Neural Networks through p-Laplacian

Many real-world systems, from neural circuits to economic networks, exhibit feedback loops that are best represented as directed cyclic graphs (DCGs). Yet most scalable causal discovery methods either impose hard acyclicity or rely on global backpropagation, making them unsuitable for feedback-rich settings. We propose PreCyc, a predictive coding framework for causal structure learning that combines node-wise energy minimisation with a soft acyclicity surrogate and sparsity regularisation. The algorithm alternates local state inference and weight updates, avoiding reverse-mode differentiation while remaining scalable to larger graphs. Our analysis shows convergence to a stationary point under standard smoothness assumptions, and we clarify the distinction between local error signals for data fit and the global nature of acyclicity enforcement. Experiments on synthetic Erdos–Renyi, Watts–Strogatz, and scale free graphs, as well as the 279-node C. elegans connectome, demonstrate competitive performance in both structure recovery and cycle identification compared with state-of-the art cyclic causal discovery methods. While the current implementation focuses on linear structural equation models with observational equilibrium data, PreCyc establishes predictive coding as a principled and scalable foundation for causal discovery in feedback-rich systems.

Predictive Coding Causal Discovery for Directed Cyclic Graphs

Prior work on node classification shows that Graph Neural Networks (GNNs) can learn transferable representations of graph properties when those properties are consistent across graphs. For a fixed graph, one would then expect GNNs trained for link prediction to learn a representation consistent with that learnt for node classification. We show this intuition does not hold in the general case. We find instead, popular link prediction models can learn a trivial mini-batch dependent heuristic, enabled by batch normalisation layers, to solve the edge classification task. When correcting for this, we observe increased alignment of network representation with node-class relevant features, suggesting the network has learnt a graph representation that better aligns with the underlying graph's properties. Our findings suggest that standard link prediction training may be leading us to overestimate link predictors' ability to learn a generalised representation of a graph that is consistent across tasks.

Mini-Batch Class Composition Bias in Link Prediction

Knowledge graph completion aims to infer unknown information in a knowledge graph that is incomplete, due to noisy or missing data. Geographic knowledge graphs, which are typically derived from crowd-sourced data, are often incomplete, making geographic knowledge graph completion an important problem. Most current methods for knowledge graph completion are generic, and do not account for the spatial nature of geographic knowledge graphs. The few methods that are tailored to geographic knowledge graphs are computationally expensive or are designed for a closed-world setting, which is not practical in the geography domain. We study this problem by evaluating existing state-of-the-art standard and geo-specific knowledge graph completion methods on a large dataset of geographic knowledge graphs. Our findings reveal that these methods perform poorly, leaving an open problem for the AI and graphs community. To aid in further research, we suggest some possible areas of work that we believe could lead to fruitful developments for this problem.

An Experimental Analysis of Geographic Knowledge Graph Completion Methods

Chain-of-thought (CoT) prompting enables Large Language Models to solve complex problems, but deploying these models safely requires reliable confidence estimates—a capability where existing methods suffer from poor calibration and severe overconfidence on incorrect predictions. We propose Enhanced Dirichlet+Topology Risk (EDTR), a novel decoding strategy that combines topological analysis with Dirichlet-based uncertainty quantification to measure LLM confidence across multiple reasoning paths. EDTR treats each CoT as a vector in high-dimensional space and extracts eight topological risk features capturing the geometric structure of reasoning distributions: tighter, more coherent clusters indicate higher confidence while dispersed, inconsistent paths signal uncertainty. We evaluate EDTR against three state-of-the-art calibration methods across four diverse reasoning benchmarks spanning olympiad-level mathematics (AIME), grade school math (GSM8K), commonsense reasoning, and stock price prediction. EDTR achieves 41\% better calibration than competing methods with an average ECE of 0.287 and the best overall composite score of 0.672, while notably achieving perfect accuracy on AIME and exceptional calibration on GSM8K with an ECE of 0.107—domains where baselines exhibit severe overconfidence. Our work provides a geometric framework for understanding and quantifying uncertainty in multi-step LLM reasoning, enabling more reliable deployment where calibrated confidence estimates are essential.

Optimizing Chain-of-Thought Confidence via Topological and Dirichlet Risk Analysis

Explainable AI (XAI) methods aiming to probe model internals for scientific discovery ("RED XAI") must move beyond correlational saliency maps. We address this by presenting a systematic comparison of segmentation methods within a causal attribution framework. We contrast an object-aware approach using the Segment Anything Model (SAM) against a texture-aware baseline using SLIC superpixels. Both are integrated into a pipeline utilizing Grad-CAM for saliency, CLIP for concept labeling, and a causal validation step quantifying concept importance via counterfactual interventions (blur masking) measured by raw confidence drop. Evaluating on 200 ImageNet images, we uncover a critical sensitivity-reliability trade-off: SAM-based object-centric concepts show significantly higher average causal impact (81.0\% mean confidence drop vs. 37.7\% for SLIC), demonstrating greater sensitivity, but suffer from segmentation failures in 9.5\% of cases (181/200 successes). SLIC achieves perfect 100\% reliability (200/200 successes) and lower impact variance, albeit with reduced sensitivity. This trade-off provides actionable guidance for domain scientists: SLIC's robustness is preferable for high-stakes, texture-reliant tasks (e.g., medical diagnostics), while SAM's sensitivity may benefit exploratory analysis of object-centric phenomena. Our work offers quantitative evidence of this trade-off, enabling more informed XAI method selection for reliable scientific insight.

Causal Quantification of the Sensitivity-Reliability Trade-Off in Semantic XAI: Comparing Object-Aware (SAM) and Texture-Aware (SLIC) Segmentation

The n-body problem, fundamental to astrophysics, simulates the motion of n bodies acting under the effect of their own mutual gravitational interactions. Traditional machine learning models that are used for predicting and forecasting trajectories are often data-intensive ”black box” models, which ignore the physical laws, thereby lacking interpretability. Whereas Scientific Machine Learning ( Scientific ML ) directly embeds the known physical laws into the machine learning frame- work. Through robust modelling in the Julia programming language, our method uses the Scientific ML frameworks: Neural ordinary differential equations (NODEs) and Univer- sal differential equations (UDEs) to predict and forecast the system’s dynamics. In addition, an essential component of our analysis involves determining the ”forecasting breakdown point”, which is the smallest possible amount of training data our models need to predict future, unseen data accurately. We employ synthetically created noisy data to simulate real-world observational limitations. Our findings indicate that the UDE model is much more data efficient, needing only 20% of data for a correct forecast, whereas the Neural ODE requires 90%.

Forecasting N-Body Dynamics: A Comparative Study of Neural Ordinary Differential Equations and Universal Differential Equations

Diffusion Transformers (DiTs) have recently replaced U-Net backbones as the dominant architecture in state-of-the-art text-to-image generative models, achieving remarkable visual fidelity. However, their internal mechanisms remain largely unexplored. In this work, we investigate the emergence of high-norm activations within DiTs—tokens with unusually large magnitudes that resemble the “outlier” tokens previously identified in Vision Transformers (ViTs). Through a systematic analysis of four DiT architectures, we find that only Flux-Schnell and Pixart-sigma exhibit such activations in the image stream, primarily concentrated in the central transformer layers. Using linear probes and qualitative ablations, we show that, unlike ViT outliers, these activations do not encode global or semantic image information and their removal has negligible effect on the generation process. We refer to these as sink registers, reflecting their passive, non-semantic role. Our findings highlight an architectural divergence between ViTs and DiTs, and contribute to a deeper interpretability of diffusion-based generative models.

Premium content

Next from AAAI 2026

KG-EDAS: A Meta-Metric Framework for Evaluating Knowledge Graph Completion Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES