Singapore

Early childhood is a critical stage for cognitive development, involving core skills such as visual perception and reasoning. While multimodal large language models (MLLMs) have made rapid progress in various general-purpose tasks, their ability to support early education remains largely underexplored. Existing research on child-related AI largely centers on modeling language, emotion, or behavior, with limited focus on evaluating cognitive tasks relevant to early learning. To address this gap, we propose ChildBench, a multimodal benchmark designed to assess models on tasks inspired by early childhood cognitive development. It covers five key domains through ten tasks, including spatial reasoning, visual reasoning, visual discrimination, counting skills, and visual tracking. The benchmark includes 4,890 carefully constructed images and 5,346 manually annotated samples, ensuring both diversity and age-appropriate content. We evaluate a range of state-of-the-art (SoTA) open-source and closed-source MLLMs—including GPT-4o, Gemini, and Qwen2.5-VL—on ChildBench. Despite strong performance on other benchmarks, the best 7B-parameter model with LoRA tuning achieves only 52.01% accuracy, far below the 96% achieved by 5-year-old children. These results reveal critical limitations in fine-grained perception and reasoning. We further analyze failure cases and discuss directions for future model development. We release ChildBench and the evaluation code at a public anonymous URL: https://anonymous.4open.science/r/ChildBench-AF78.

AAAI 2026

Easy for Children, Hard for AI: The Limits of Multimodal LLMs in Early Childhood Learning

language grounding & multi-modal nlp

multi-modal vision

language and vision

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

AI systems are widely proposed as second-opinion advisors in clinical diagnosis, offering the promise of enhancing decision accuracy and clinician confidence while preserving human oversight. However, successful deployment in real-world practice faces a critical barrier: clinicians' reliance on AI is often miscalibrated, manifesting as misuse (over-reliance driven by automation bias) and disuse (under-utilization driven by self-anchoring bias). 
This paper addresses these deployment challenges by systematically analyzing how such reliance patterns affect diagnostic accuracy, confidence, and decision-making across diverse medical specialties. We report results from controlled simulations involving over 300 medical professionals across six diagnostic settings—including knee MRI analysis, spinal X-rays, cardiac ECG evaluation, and gastrointestinal endoscopy—using a human-first, AI-second workflow. Although AI advice improved average diagnostic accuracy (+2 percentage points) and clinician confidence (+3 points on a normalized scale), overall levels of appropriate reliance remained well below 50%, with disuse emerging as the more prevalent and consequential barrier.
We introduce and validate Appropriate Reliance as an actionable metric for assessing and improving human-AI collaboration, providing practical guidance for developers, healthcare institutions, and policymakers seeking to deploy second-opinion AI systems safely and effectively. By identifying the sociotechnical barriers and offering evidence-based design insights, this work supports the emerging application of AI as a collaborative advisor in clinical workflows, charting a clear path toward deployment that enhances diagnostic safety, accountability, and patient care. Specifically, we propose integrating the Appropriate Reliance metric into system development workflows, clinician training, and regulatory evaluations to enable safe and effective deployment of second-opinion AI systems.

Calibrating Reliance: Addressing Misuse and Disuse in AI-Based Second-Opinion Systems for Medical Diagnosis

Large-scale map construction is foundational for critical applications such as autonomous driving and navigation systems.
Traditional large-scale map construction approaches mainly rely on costly and inefficient special data collection vehicles and labor-intensive annotation processes.
While existing satellite-based methods have demonstrated promising potential in enhancing the efficiency and coverage of map construction, they exhibit two major limitations: 
(1) inherent drawbacks of satellite data (e.g., occlusions, outdatedness)
and (2) inefficient vectorization from perception-based methods, resulting in discontinuous and rough roads that require extensive post-processing.
This paper presents a novel generative framework, UniMapGen, for large-scale map construction, offering three key innovations: 
(1) representing lane lines as discrete sequence and establishing an iterative strategy to generate more complete and smooth map vectors than traditional perception-based methods.
(2) proposing a flexible architecture that supports multi-modal inputs, enabling dynamic selection among BEV, PV, and text prompt, to overcome the drawbacks of satellite data.
(3) developing a state update strategy for global continuity and consistency of the constructed large-scale map.
UniMapGen achieves state-of-the-art performance on the OpenSatMap dataset. 
Furthermore, UniMapGen can infer occluded roads and predict roads
missing from dataset annotations.
Our code will be released.

UniMapGen: A Generative Framework for Large-Scale Map Construction from Multi-modal Data

Vision-Language Models (VLMs) have been applied to autonomous driving to support decision-making in complex real-world scenarios. However, their training on static, web-sourced image-text pairs fundamentally limits the precise spatiotemporal reasoning required to understand and predict dynamic traffic scenes. We address this critical gap with STRIDE-QA, a large-scale visual question answering (VQA) dataset for physically grounded reasoning from an ego-centric perspective. Constructed from 100 hours of multi-sensor driving data in Tokyo, capturing diverse and challenging conditions, STRIDE-QA is the largest VQA dataset for spatiotemporal reasoning in urban driving, offering 16 million QA pairs over 285K frames. Grounded by dense, automatically generated annotations including 3D bounding boxes, segmentation masks, and multi-object tracks, the dataset uniquely supports both object-centric and ego-centric reasoning through three novel QA tasks that require spatial localization and temporal prediction. Our benchmarks demonstrate that existing VLMs struggle significantly, achieving near-zero scores on prediction consistency. 
In contrast, VLMs fine-tuned on STRIDE-QA exhibit dramatic performance gains, achieving 55\% success in spatial localization and 28\% consistency in future motion prediction, compared to near-zero scores from general-purpose VLMs.
Therefore, STRIDE-QA establishes a comprehensive foundation for developing more reliable VLMs for safety-critical autonomous systems. Dataset and code will be released upon acceptance.

STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes

Language models (LMs) are often adapted through supervised fine-tuning (SFT) to specialize their capabilities for downstream tasks. However, in typical scenarios where the fine-tuning data is limited, e.g., compared to pre-training, SFT can lead LMs to overfit, causing them to rely on spurious patterns within the target task or to compromise other broadly useful capabilities as a side effect of narrow specialization. In this paper, we propose Learning-from-the-Undesirable (LfU), a simple yet effective regularization scheme for SFT to mitigate overfitting issues when fine-tuning LMs with limited data. Specifically, we aim to regularize the fine-tuning process to favor solutions that are resilient to “undesirable” model updates, e.g., gradient ascent steps that steer the model toward undesirable behaviors. To this end, we propose a novel form of consistency regularization that directly aligns internal representations of the model with those after an undesirable update. By leveraging representation-level data augmentation through undesirable updates, LfU effectively promotes generalization under limited data. Our experiments on diverse LM downstream tasks show that LfU serves as an effective prior that enhances adaptability while preserving pretrained knowledge. For example, our LM from LfU achieves a 16.8% average improvement on math tasks compared to vanilla SFT on the same dataset, where the latter even leads to degraded performance on those tasks. Furthermore, LfU exhibits improved robustness to prompt variations, e.g., yielding a 92.1% lower standard deviation in output performances compared to SFT, highlighting its versatile effects.

Learning from the Undesirable: Robust Adaptation of Language Models Without Forgetting

The use of learned dynamics models, also known as world models, can improve the sample efficiency of reinforcement learning. Recent work suggests that the underlying causal graphs of such dynamics models are sparsely connected, with each of the future state variables depending only on a small subset of the current state variables, and that learning may therefore benefit from sparsity priors. Similarly, temporal sparsity, i.e. sparsely and abruptly changing local dynamics, has also been proposed as a useful inductive bias. In this work, we critically examine these assumptions by analyzing ground truth dynamics from a set of robotic reinforcement learning environments in the MuJoCo Playground benchmark suite, aiming to determine whether the proposed notions of state and temporal sparsity actually tend to hold in typical reinforcement learning tasks. We study (i) whether the causal graphs of environment dynamics are sparse, (ii) whether such sparsity is state-dependent, and (iii) whether local system dynamics change sparsely. Our results indicate that global sparsity is rare, but instead the tasks show local, state-dependent sparsity in their dynamics and this sparsity exhibits distinct structures, appearing in temporally localized clusters (e.g., during contact events) and affecting specific subsets of state dimensions. These findings challenge common sparsity prior assumptions in dynamics learning, emphasizing the need for grounded inductive biases that reflect the state-dependent sparsity structure of real-world dynamics.

Dynamic Sparsity: Challenging Common Sparsity Assumptions for Learning World Models in Robotic Reinforcement Learning Benchmarks

The more extensive access to time-series data, especially
for biomedical purposes, raises new methodological
challenges, particularly regarding missing values.
Functional linear discriminant analysis (FLDA) extends
Linear Discriminant Analysis (LDA)-mediated multiclass
classification and dimension reduction to data in the form
of fragmented observations of a univariate function. For
large multivariate and partially-observed data, there are
two challenges: (i) statistical dependencies between
different components of a multivariate function and (ii)
heterogeneous sampling times with missing features. We here
develop a multivariate version of FLDA, called MUDRA, to
tackle these challenges and describe a computationally
efficient expectation/conditional-maximisation (ECM)
algorithm to infer its parameters without any tensor
inversions. We assess its predictive power on the
“Articulary Words” dataset and show its improvement over
the state-of-the-art, especially in the case of missing
data. This advancement in dimension reduction of
multivariate functional data holds promise for enhancing
classification accuracy in scenarios like partially
observed short multivariate time series analysis.

Multivariate functional linear discriminant analysis for
partially-observed time series

Safety alignment instills in Large Language Models (LLMs) a critical capacity to refuse malicious requests. Prior works have modeled this refusal mechanism as a single linear direction in the activation space. We posit that this is an oversimplification that conflates two functionally distinct neural processes: the detection of harm and the execution of a refusal. In this work, we deconstruct this single representation into a Harm Detection Direction and a Refusal Execution Direction. Leveraging this fine-grained model, we introduce Differentiated Bi-Directional Intervention (DBDI), a new white-box framework that precisely neutralizes the safety alignment at critical layer. DBDI applies adaptive projection nullification to the refusal execution direction while suppressing the harm detection direction via direct steering. Extensive experiments demonstrate that DBDI outperforms prominent jailbreaking methods, achieving up to a 97.88% attack success rate on models such as Llama-2. By providing a more granular and mechanistic framework, our work offers a new direction for the in-depth understanding of LLM safety alignment.

Differentiated Directional Intervention: A Framework for Evading LLM Safety Alignment

Ensuring safety in power grid construction remains a critical yet challenging task, as existing monitoring approaches often lack scalability, timeliness, and adaptability to diverse on-site conditions. To address these limitations, we present \textbf{ConstructAI}, a deployed AI-driven safety management system that integrates multi-source image and video acquisition devices with advanced multimodal large model reasoning. The system combines text, image, and video prompts through an efficient workflow powered by LLaMA3 and Meta SAM2 backbones, enhanced with LoRA and adaptor modules for multimodal fusion. Once deployed, ConstructAI continuously processes real-time construction footage to identify violations, assess risk levels, and generate standardized rectification requirements. The deployment has demonstrated measurable benefits across multiple sites, including a $>70\%$ increase in violation rectification rates, reduction of average rectification delays from hours to minutes, and a $45\%$ decline in repeat violations. Beyond technical gains, ConstructAI has delivered significant business impacts, such as reduced safety incidents, improved compliance with national regulations, and higher operational efficiency. By enabling proactive risk management and structured safety feedback loops, our system exemplifies how innovative use of AI can translate into tangible improvements for industrial safety. The lessons learned from deployment highlight the importance of balancing algorithmic advances with practical integration into organizational workflows.

ConstructAI: From Real-Time Safety Insight to Skill Growth in Deployed Construction AI Systems

Behavioral health conditions, which include mental health and substance use disorders, are the leading disease burden in the United States. Peer-run behavioral health organizations (PROs) critically assist individuals facing these conditions by combining mental health services with assistance for needs such as income, employment, and housing. However, limited funds and staffing make it difficult for PROs to address all service user needs. To assist peer providers at PROs with their day-to-day tasks, we introduce PeerCoPilot, a large language model (LLM)-powered assistant that helps peer providers create wellness plans, construct step-by-step goals, and locate organizational resources to support these goals. PeerCoPilot ensures information reliability through a retrieval-augmented generation pipeline backed by a large database of over 1,300 vetted resources. We conducted human evaluations with 15 peer providers and 6 service users and found that over 90\% of users supported using PeerCoPilot. Moreover, we demonstrate that PeerCoPilot provides more reliable and specific information than a baseline LLM. PeerCoPilot is now used by a group of 5-10 peer providers at CSPNJ, a large behavioral health organization serving over 10,000 service users, and we are actively expanding PeerCoPilot's use.

Downloads

Next from AAAI 2026

Calibrating Reliance: Addressing Misuse and Disuse in AI-Based Second-Opinion Systems for Medical Diagnosis

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Calibrating Reliance: Addressing Misuse and Disuse in AI-Based Second-Opinion Systems for Medical Diagnosis

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads