Singapore

Existing stereotype auditing methods for large language models (LLM) typically rely on isolated rating schemes or task-specific probes, lacking a theoretical grounding and failing to reveal the internal organization beyond surface-level output patterns. In this paper, we introduce SCoUT (Stereotype Content oriented Utility structure via Thurstonian modeling), a closed-loop framework that structurally models, explicitly probes, and causally intervenes on stereotype dimensions(warmth and competence) in LLMs. SCoUT first reconstructs a global stereotype utility structure aligned with Stereotype Content Model theory via Thurstonian comparative judgments. Across multiple open-source LLMs, this modeling achieves high pairwise-preference prediction accuracy ($\ge0.90$ on larger-scale models) and exhibits strong cross-model consistency. Probing internal attention mechanisms localizes this structure to specific heads (Spearman’s $\rho$ up to 0.83 for warmth and 0.90 for competence) and surfaces a salient asymmetry between warmth and competence. Further, targeted inference-time activation modifications on these dimension-sensitive heads consistently steer model outputs along the intended axes. By bridging behavioral measurement with internal representation and controllable steering, SCoUT offers an end-to-end framework that uncovers and interprets the latent structure of stereotypes, advancing stereotype auditing from surface detection to structural analysis.

AAAI 2026

SCoUT: A Framework for Structured Stereotype Analysis in Language Models

nlp: ethics — bias

nlp: (large) language models

and evaluation of nlp models

nlp: interpretability

transparency & privacy

fairness

analysis

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

High-quality long-context data is essential for training large language models (LLMs) capable of processing extensive documents, yet existing synthesis approaches using relevance-based aggregation face challenges of computational efficiency. We present LiteLong, a resource-efficient method for synthesizing long-context data through structured topic organization and multi-agent debate. Our approach leverages the BISAC book classification system to provide a comprehensive hierarchical topic organization, and then employs a debate mechanism with multiple LLMs to generate diverse, high-quality topics within this structure. For each topic, we use lightweight BM25 retrieval to obtain relevant documents and concatenate them into 128K-token training samples. Experiments on HELMET and Ruler benchmarks demonstrate that LiteLong achieves competitive long-context performance and can seamlessly integrate with other long-dependency enhancement methods. LiteLong makes high-quality long-context data synthesis more accessible by reducing both computational and data engineering costs, facilitating further research in long-context language training.

LiteLong: Resource-Efficient Long-Context Data Synthesis for LLMs

LiDAR odometry is a critical component of SLAM in autonomous driving and robotics. Learning-based methods have shown remarkable performance by regressing relative poses in an end-to-end manner. However, when applying these trained models, originally developed on the widely used KITTI dataset, to other scenes, performance often drops significantly. In other words, existing methods struggle to generalize well to new environments. To address this challenge, we propose RCP-LO, a simple yet effective LiDAR odometry framework. 
We introduce a novel representation for relative poses, reformulating them as relative coordinates, which can then be solved using geometrical verification. This approach avoids overly simplified pose representations and makes better use of scene geometry, thereby improving generalization.
Moreover, to capture the inherent uncertainties in relative pose estimation from occluded LiDAR point clouds from dynamic environments, we adapt our framework to learn a denoising diffusion model, allowing for sampling plausible relative coordinates while improving robustness. We also introduce a differentiable geometric weighted singular value decomposition module, enabling efficient pose estimation through a single forward pass. 
Extensive experiments demonstrate that RCP-LO, trained exclusively on the KITTI dataset, achieves competitive performance compared to SOTA learning-based methods and generalizes effectively to the KITTI-360, Ford, and Oxford datasets. Our code will be made available upon acceptance.

RCP-LO: A Relative Coordinate Prediction Framework for Generalizable Deep LiDAR Odometry

Knowledge Graph Embedding (KGE) aims to map entities and relationships into a continuous vector space to facilitate reasoning and downstream tasks. Although previous KGE methods based on Euclidean, complex spaces, or hyperbolic spaces have performed well, they still struggle to effectively model Z-Paradox relation patterns which account for a large proportion in each knowledge graph. To address this issue, we propose a novel KGE method **FlorE** which integrates full Lorentz Group and directional offset operation in hyperbolic space for KGE task. Specifically, we incorporates the full Lorentz Group to enable the same relation in knowledge graph (KG) to perform indefinite isometry, thus avoiding the overlapping of entities. Meanwhile, we implement directional offset operation via exponential mapping to transform the relations to the same Lorentz manifold of the entities, thus maintaining geometric consistency for the relations and entities in KG. By integrating these two techniques, FlorE can effectively model the Z-Paradox relation patterns and improve the representation learning ability for KGs. Experiments on the five benchmark datasets demonstrate that our method achieves state-of-the-art performance. For the Z-Paradox relation patterns, the improvement achieves **26.7\%**, **15.6\%**, **35.4\%**, **33.7\%**, and **31.5\%** on FB15k-237, WN18RR, CoDEx-S, CoDEx-M and CoDEx-L, respectively.

FlorE: Integrating Full Lorentz Group and Directional Offsets for Effective Knowledge Graph Embedding

Branch-and-bound (B\&B) is a fundamental algorithmic framework for solving Mixed-Integer Linear Programming (MILP) problems, where branching decisions critically affect solver efficiency. Recent learning-based methods apply imitation learning to select branching variables, but their deterministic predictions limit exploration and generalization. In this paper, we propose a novel framework that formulates branching variable selection as a conditional generative process, exploring deep-level decision features. Our approach leverages diffusion models to enable diverse and exploratory branching score generation, while consistency modeling distills this process into efficient one-step inference conditioned on the B\&B state. This mode allows our method to achieve both high-quality and fast branching decisions, significantly improving the overall performance of branch-and-bound solvers. Extensive experiments on challenging cross-scale and cross-category benchmarks demonstrate that our framework consistently outperforms state-of-the-art imitation learning baselines, delivering substantial improvements in solution quality, computational efficiency, and inference speed.

Generative Branching for Mixed-Integer Linear Programming

Anomaly detection in dynamic graphs is a critical area of research that focuses on identifying abnormal components within evolving graph structures that deviate significantly from typical patterns. Despite advancements in traditional temporal pattern mining and deep learning techniques, a comprehensive benchmarking framework for Dynamic Graph Anomaly Detection (DyGAD) has been lacking. To address this gap, we introduce \textbf{BAG}, the first comprehensive benchmark specifically designed for anomaly detection on dynamic graphs. BAG enables extensive evaluation of 25 leading DyGAD models, covering both classical approaches and advanced Dynamic Graph Neural Networks (DGNNs), across 10 diverse real-world datasets that include both synthetic and naturally occurring anomalies. The framework supports evaluations at both the edge and node levels, offering a robust tool to advance DyGAD research. Our main finding is that Continuous-time Dynamic Graph (CTDG) models demonstrate superior performance and potential in detecting anomalies in dynamic graph edges, compared to Discrete-time Dynamic Graph (DTDG) models. Furthermore, the results reveal that existing methods are less effective at detecting organic anomalies, primarily due to the presence of temporal anomalies and highly imbalanced samples. The proposed BAG benchmark significantly enhances the evaluation of DyGAD methods by improving dataset selection, metric application, and model training. Moreover, BAG supports reproducibility and further exploration in this field by integrating all models, datasets, and evaluation protocols into an open-source repository at \url{https://github.com/opensource-cmd/BAG}.

BAG: Benchmarking Anomaly Detection on Dynamic Graphs

Large Language Models (LLMs) and causal learning each hold strong potential for clinical decision making (CDM). However, their synergy remains poorly understood, largely due to the lack of systematic benchmarks evaluating their integration in clinical risk prediction. In real-world healthcare, identifying features with causal influence on outcomes is crucial for actionable and trustworthy predictions. While recent work highlights LLMs' emerging causal reasoning abilities, there lacks comprehensive benchmarks to assess their causal learning and performance informed by causal features in clinical risk prediction. To address this, we introduce REACT-LLM, a benchmark designed to evaluate whether combining LLMs with causal features can enhance clinical prognostic performance and potentially outperform traditional machine learning (ML) methods. Unlike existing LLM-clinical benchmarks that often focus on a limited set of outcomes, REACT-LLM evaluates 7 clinical outcomes across 2 real-world datasets, comparing 15 prominent LLMs, 6 traditional ML models, and 3 causal discovery (CD) algorithms. Our findings indicate that while LLMs perform reasonably in clinical prognostics, they have not yet outperformed traditional ML models. Integrating causal features derived from CD algorithms into LLMs offers limited performance gains, primarily due to the strict assumptions of many CD methods, which are often violated in complex clinical data. While the direct integration yields limited improvement, our benchmark reveals a more promising synergy: LLMs serve effectively as knowledge-rich collaborators for identifying and optimizing causal features. Additionally, in-context learning improves LLM predictions when prompts are tailored to the task and model. Different LLMs show varying sensitivity to structured data encoding formats, for example, open-source models perform better with JSON, while smaller models benefit from narrative serialization. These findings highlight the need to match prompts and data formats to model architecture and pretraining. Our code is publicly available at: https://github.com/LinnaWang-Lena/REACT_LLM.

REACT-LLM: A Benchmark for Evaluating LLM Integration with Causal Features in Clinical Prognostic Tasks

Gait recognition has emerged as a promising biometric technique for long-distance and non-intrusive human identification. While Transformers have revolutionized vision tasks, their adaptation to gait recognition remains underexplored due to domain-specific challenges such as sparse silhouette modality, spatial-temporal dynamics, fine-grained motion cues, and limited training data. In this paper, we propose Gait Transformer (GaT), an end-to-end Transformer backbone specifically tailored for silhouette-based gait recognition. GaT introduces three key components: (1) a hybrid patch embedding module that combines convolutional stems with group-batch normalization to enhance structural preservation; (2) a decomposed token mixer that explicitly models both short-range and long-range dependencies across spatial-temporal dimensions; and (3) a hybrid positional encoding strategy that integrates absolute, relative, and rotary embeddings to support efficient training under data scarcity. Without relying on any pretraining, GaT achieves state-of-the-art performance on Gait3D, GREW, and CCGR-MINI.

Gait Transformer: End-to-End Transformer Backbone for Gait Recognition

Diffusion models have revealed powerful potential in all-in-one image restoration (AiOIR), which is talented in generating abundant texture details. The existing AiOIR methods either retrain a diffusion model or fine-tune the pretrained diffusion model with extra conditional guidance. However, they often suffer from high inference costs and limited adaptability to diverse degradation types. In this paper, we propose an efficient AiOIR method, Diffusion Once and Done (DOD), which aims to achieve superior restoration performance with only one-step sampling of Stable Diffusion (SD) models. Specifically, multi-degradation feature modulation is first introduced to capture different degradation prompts with a pretrained diffusion model. Then, parameter-efficient conditional low-rank adaptation integrates the prompts to enable the fine-tuning of the SD model for adapting to different degradation types. Besides, a high-fidelity detail enhancement module is integrated into the decoder of SD to improve structural and textural details. Experiments demonstrate that our method outperforms existing diffusion-based restoration approaches in both visual quality and inference efficiency.

Diffusion Once and Done: Degradation-Aware LoRA for All-in-One Image Restoration

Magnetic Particle Imaging (MPI) is an innovative medical modality, providing nanomolar-scale in vivo sensitivity and radiation-free dynamic real-time detection for precision medicine. However, MPI faces a challenging problem in accurately visualizing nanoparticle distributions, where the reconstructed images with unidirectional scanning exhibit anisotropy. The anisotropy in spatial resolution leads to distortion and blurred image boundaries. Existing deep learning methods for anisotropy calibration are only limited to simulation data due to lacking of real-world MPI datasets. To address the aforementioned problems, we spent over three years designing and constructing a real-world MPI anisotropic image datasets (20,156 images) with diverse phantoms (sensitivity, resolution, vessel, shape) and animal scanning. Then, we introduce a novel Mamba-based method, MPI-Mamba, for anisotropic image calibration. Specifically, we propose a latent feature fusion state space model (LFF-SSM) block for feature fusion and leverage conditional latent diffusion model (CL-DM) branch for feature extraction. The CL-DM is performed to extract latent features in a highly compressed latent space for guiding the calibration and deblurring process. Next, we exploit the LFF-SSM to fully fuse the extracted multi-scale features to capture contextual information from the image structure, enabling the model to learn the overall distribution of signal concentration. We evaluate our method and competing methods on simulation dataset and our constructed diverse real-world MPI datasets. The results show that our proposed approach outperforms competing methods for anisotropic image calibration and deblurring. Source code and real-world MPI dataset will be available upon acceptance.

MPI-Mamba: Latent Feature Fusion Mamba for Anisotropic Image Calibration and Deblurring in Magnetic Particle Imaging

Long-context inference in large language models (LLMs) is increasingly constrained by the KV cache bottleneck: memory usage grows linearly with sequence length, while attention computation scales quadratically. Existing approaches address this issue by compressing the KV cache along the temporal axis through strategies such as token eviction or merging to reduce memory and computational overhead. However, these methods often neglect fine-grained importance variations across feature dimensions (i.e., the channel axis), thereby limiting their ability to effectively balance efficiency and model accuracy. In reality, we observe that channel saliency varies dramatically across both queries and positions: certain feature channels carry near-zero information for a given query, while others spike in relevance. To address this oversight, we propose SPARK, a training-free plug-and-play method that applies unstructured sparsity by pruning KV at the channel level, while dynamically restoring the pruned entries during attention score computation. Notably, our approach is orthogonal to existing KV compression and quantization techniques, making it compatible for integration with them to achieve further acceleration. By reducing channel-level redundancy, SPARK enables processing of longer sequences within the same memory budget. For sequences of equal length, SPARK not only preserves or improves model accuracy but also reduces KV cache storage by over 30% compared to eviction-based methods. Furthermore, even in an aggressive pruning ratio of 80%, SPARK maintains performance with less degradation than 5% compared to the based eviction method, demonstrating robustness and effectiveness. Our code will be available at \url{https://github.com/AMD-AIG-AIMA/AMD-Spark}.

Downloads

Next from AAAI 2026

LiteLong: Resource-Efficient Long-Context Data Synthesis for LLMs

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

LiteLong: Resource-Efficient Long-Context Data Synthesis for LLMs

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads