Singapore

Auditing large language models (LLMs) for biases is an ongoing and dynamic process, resembling a proverbial cat-and-mouse game. As researchers identify new vulnerabilities in LLMs, guardrails are updated to address them, prompting the need for innovative approaches to audit the increasingly fortified LLMs for biases. 
This paper makes three contributions. First, it introduces a scalable, explainable framework to measure biases against various identity groups across multiple open large language models. Second, it conducts a bias audit considering five well-known open LLMs and demonstrates their bias inclinations towards several historically disadvantaged groups. Our audit reveals disturbing antisemitic, Islamophobic, and xenophobic biases present in several well-known LLMs. Finally, we release a dataset of 1,000 probes curated under the supervision of an expert social scientist that can facilitate similar audits.

AAAI 2026

How Can You Tell if Your Large Language Model Could Be a Closet Antisemite? An Explainability-Based Audit Framework for Implicit Bias

antisemitism in llm

llm bias audit

responsible ai

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Embedding deep neural networks (NNs) into mixed-integer programs (MIPs) is attractive for decision making with learned constraints, yet state-of-the-art “monolithic’’ linearisations blow up in size and quickly become intractable. In this paper, we introduce a novel dual-decomposition framework that relaxes the single coupling equality $u=x$ with an augmented Lagrange multiplier and splits the problem into a vanilla MIP and a constrained NN block. Each part is tackled by the solver that suits it best—branch \& cut for the MIP subproblem, first-order optimisation for the NN subproblem—so the model remains modular, the number of integer variables never grows with network depth, and the per-iteration cost scales only linearly with the NN size. On the public \textsc{SurrogateLIB} benchmark, our method proves scalable, modular, and adaptable: it runs 120$\times$ faster than an exact Big–M formulation on the largest test case; the NN sub-solver can be swapped from a log-barrier interior step to a projected-gradient routine with no code changes and identical objective value; and swapping the MLP for an LSTM backbone still completes the full optimisation in 47s without any bespoke adaptation.

Scalable Mixed-Integer Optimization with Neural Constraints via Dual Decomposition

Referring Multi-Object Tracking (RMOT) aims to achieve precise object detection and tracking through natural language instructions, representing a fundamental capability for intelligent robotic systems. However, current RMOT research remains mostly confined to ground-level scenarios, which constrains their ability to capture broad-scale scene contexts and perform comprehensive tracking and path planning. In contrast, Unmanned Aerial Vehicles (UAVs) leverage their expansive aerial perspectives and superior maneuverability to enable wide-area surveillance. Moreover, UAVs have emerged as critical platforms for Embodied Intelligence, which has given rise to an unprecedented demand for intelligent aerial systems capable of natural language interaction. 
To this end, we introduce AerialMind, the first large-scale RMOT benchmark in UAV scenarios, which aims to bridge this research gap. To facilitate its construction, we develop an innovative semi-automated collaborative agent-based labeling assistant (COALA) framework that significantly reduces labor costs while maintaining annotation quality. Furthermore, we propose HawkEyeTrack (HETrack), a novel method that collaboratively enhances vision-language representation learning and improves the perception of UAV scenarios. Comprehensive experiments validated the challenging nature of our dataset and the effectiveness of our method. More details about the dataset, development kits, and code can be found in the supplementary materials.

AerialMind: Towards Referring Multi-Object Tracking in UAV Scenarios

Segment Anything Model (SAM) struggles in open-world scenarios with diverse domains. In such settings, naive fine-tuning with a well-designed learning module is inadequate and often causes *catastrophic forgetting* issue when learning incrementally. To address this issue, we propose a novel continual learning (CL) method for SAM, termed **SAMCL**. Rather than relying on a fixed learning module, our method decomposes incremental knowledge into separate modules and trains a selector to choose the appropriate one during inference. However, this intuitive design introduces two key challenges: ensuring effective module learning and selection, and managing storage as tasks accumulate. To tackle these, we introduce two components: *AugModule* and *Module Selector*. *AugModule* reduces the storage of the popular LoRA learning module by sharing parameters across layers while maintaining accuracy. It also employs heatmaps—generated from point prompts—to further enhance domain adaptation with minimal additional cost. *Module Selector* leverages the observation that SAM’s embeddings can effectively distinguish domains, enabling high selection accuracy by training on low-consumed embeddings instead of raw images. Experiments show that **SAMCL** outperforms state-of-the-art methods, achieving only $0.19\%$ forgetting and at least $2.5\%$ gain on unseen domains. Each *AugModule* requires just $0.233$ MB, reducing storage by at least $24.3\%$ over other fine-tuning approaches. The buffer storage for *Module Selector* is further reduced by up to $256\times$.

SAMCL: Empowering SAM to Continually Learn from Dynamic Domains with Extreme Storage Efficiency

Hypergraph neural networks (HGNNs) have shown great potential in modeling higher-order relationships among multiple entities. However, most existing HGNNs primarily emphasize low-pass filtering while neglecting the role of high-frequency information. In this work, we present a theoretical investigation into the spectral behavior of HGNNs and prove that combining both low-pass and high-pass components leads to more expressive and effective models. Notably, our analysis highlights that high-pass signals play a crucial role in capturing local discriminative structures within hypergraphs. Guided by these insights, we propose a novel sheaflet-based HNNs that integrates cellular sheaf theory and framelet transforms to preserve higher-order dependencies while enabling multi-scale spectral decomposition. This framework explicitly emphasizes high-pass components, aligning with our theoretical findings. Extensive experiments on benchmark datasets demonstrate the superiority of our approach over existing methods, validating the importance of high-frequency information in hypergraph learning.

High-Pass Matters: Theoretical Insights and Sheaflet-Based Design for Hypergraph Neural Networks

Task scheduling has become increasingly critical for embodied AI, where agents need to follow natural language instructions and execute actions efficiently in 3D physical worlds. Existing datasets for task planning in 3D environments often simplify the problem, lacking operations research knowledge for task scheduling and 3D grounding for real-world applications. In this work, we propose Operations Research Knowledge-based 3D Grounded Task Scheduling (OKS3D), a new task that requires synerization of language understanding, 3D grounding, and efficiency optimization for embodied agents. OKS3D reflects real-world demands by requiring agents to generate efficient, step-by-step schedules that are grounded in 3D space. To facilitate research on OKS3D, we construct a large-scale dataset called OKS3D-60K, comprising 60K tasks across 4K real-world scenes. Furthermore, we propose GRANT, an embodied multi-modal large language model equipped with a simple yet effective scheduling token mechanism to generate efficient task schedules and grounded actions. Extensive experiments on the OKS3D-60K dataset validate the effectiveness of GRANT across language understanding, 3D grounding, and scheduling efficiency. The code and dataset will be released.

Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution

Self-evaluation, a model's ability to assess the correctness of its own output, is crucial for Large Multimodal Models (LMMs) to achieve self-improvement in multi-turn conversations, yet largely absent in foundation models. Recent work has employed reinforcement learning (RL) to enhance self-evaluation; however, its fixed reward mechanism suffers from reward hacking when optimizing multiple training objectives, leading to model collapse. In this paper we propose AdaPO, an online reinforcement learning framework capable of adaptively adjusting training objective in real time according to the current training state for each task. Specifically, to mitigate reward hacking , AdaPO introduces an Adaptive Reward Model (ARM) and a Reward Aware Dynamic KL Regularization mechanism. ARM assesses the task's training state from the distribution of model generated multi-turn trajectories' performance. Reward Aware Dynamic KL replaces a fixed penalty with dynamic coefficients which is modulated by the reward gap between different multi-turn situations. 
Notably, our method automatically and smoothly adjusts its learning focus based on sub-tasks' training progress without manual intervention.
Extensive experiments over 8 benchmarks and various models show that our method significantly enhances both direct reasoning and self-evaluation capability. We will release our code to contribute to the community.

A Rolling Stone Gathers No Moss: Adaptive Policy Optimization for Stable Self-Evaluation in Large Multimodal Models

Generative recommendation (GR) is an emerging paradigm that tokenizes items into discrete tokens and learns to autoregressively generate the next tokens as predictions. While this token-generation paradigm is expected to surpass traditional transductive methods, potentially generating new items directly based on semantics, we empirically show that GR models predominantly generate items seen during training and struggle to recommend unseen items. In this paper, we propose SpecGR, a plug-and-play framework that enables GR models to recommend new items in an inductive setting. SpecGR uses a *drafter* model with inductive capability to propose candidate items, which may include both existing items and new items. The GR model then acts as a *verifier*, accepting or rejecting candidates while retaining its strong ranking capabilities. We further introduce the guided re-drafting technique to make the proposed candidates more aligned with the outputs of generative recommendation models, improving the verification efficiency. We consider two variants for drafting: (1) using an auxiliary drafter model for better flexibility, or (2) leveraging the GR model's own encoder for parameter-efficient self-drafting. Extensive experiments on three real-world datasets demonstrate that SpecGR exhibits both strong inductive recommendation ability and the best overall performance among the compared methods.

Inductive Generative Recommendation via Retrieval-based Speculation

Alerts generated by Security Operations Centers (SOCs) are often numerous and scattered, requiring significant effort from security analysts to manage, which severely slows response times.
While recent alert correlation graph methods can effectively reduce alert volume, these graphs are often too complex for analysts to understand.
As a result, analysts are increasingly seeking ways to automatically correlate alerts and generate concise, human-readable attack path summaries.
Recently, Large Language Models (LLMs) have demonstrated superior performance due to their advanced capabilities in knowledge reserve and reasoning.
In this work, we propose GARNET, a framework that uses LLMs for reasoning on alert correlation graphs.
GARNET addresses three key technical challenges: 1) modality alignment between alert graphs and logs; 2) semantic alignment between alert graphs and logs; 3) enabling LLMs reasoning along graph paths.
Specifically, we first project the embeddings of the graph and logs into the same vector space using contrastive learning. Then, we design self-supervised graph-log instructions to bridge the semantic gap between the graph and logs by training a novel LLM. Finally, GARNET uses a novel Graph-of-Thought (GoT)-based interaction reasoning approach to guide LLM reasoning along graph paths, ultimately generating structured, concise, and human-readable attack path summaries.
Experimental results across six attack scenarios show that GARNET reduces false positives by an average of 80\%, lowering the false positive rate to below 0.0037. It outperforms the latest approaches and provides more explainable attribution.

GARNET: GoT-Based Alert Reduction and Narrative Event Tracing

Incomplete multi-view clustering (IMVC) aims to group data into meaningful clusters when each sample is only partially observed across multiple views. Most existing methods either rely on imputation strategies that may introduce noise and distort the underlying data distribution, or adopt cross-view alignment techniques that focus on pairwise relationships, often resulting in suboptimal representations and unstable clustering performance. In this paper, we propose **G**eometry-**A**ware **V**ariational **I**nformation **M**aximization for Deep Incomplete Multi-view Clustering (GAVIM), a novel imputation-free variational framework that enables robust and coherent incomplete multi-view clustering. Specifically, GAVIM leverages mutual information maximization to preserve the high mutual information between the available multi-view data and the shared embedding. Morever, we explicitly retain local geometric consistency within each view-specific latent space under the guidance of an adaptive global supervision signal. Lastly, GAVIM aligns all views simultaneously using a Gramian representation alignment measure, ensuring coherent structure across modalities and promoting unified, semantically meaningful representations. Extensive experiments on five benchmark IMVC datasets with varying levels of view incompleteness demonstrate that GAVIM consistently outperforms state-of-the-art methods in clustering accuracy and representation quality.

Geometry-Aware Variational Information Maximization for Deep Incomplete Multi-view Clustering

Early childhood is a critical stage for cognitive development, involving core skills such as visual perception and reasoning. While multimodal large language models (MLLMs) have made rapid progress in various general-purpose tasks, their ability to support early education remains largely underexplored. Existing research on child-related AI largely centers on modeling language, emotion, or behavior, with limited focus on evaluating cognitive tasks relevant to early learning. To address this gap, we propose ChildBench, a multimodal benchmark designed to assess models on tasks inspired by early childhood cognitive development. It covers five key domains through ten tasks, including spatial reasoning, visual reasoning, visual discrimination, counting skills, and visual tracking. The benchmark includes 4,890 carefully constructed images and 5,346 manually annotated samples, ensuring both diversity and age-appropriate content. We evaluate a range of state-of-the-art (SoTA) open-source and closed-source MLLMs—including GPT-4o, Gemini, and Qwen2.5-VL—on ChildBench. Despite strong performance on other benchmarks, the best 7B-parameter model with LoRA tuning achieves only 52.01% accuracy, far below the 96% achieved by 5-year-old children. These results reveal critical limitations in fine-grained perception and reasoning. We further analyze failure cases and discuss directions for future model development. We release ChildBench and the evaluation code at a public anonymous URL: https://anonymous.4open.science/r/ChildBench-AF78.

Downloads

Next from AAAI 2026

Scalable Mixed-Integer Optimization with Neural Constraints via Dual Decomposition

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Scalable Mixed-Integer Optimization with Neural Constraints via Dual Decomposition

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads