Singapore

Scene-level captioning in instructional videos can enhance learning by requiring an understanding of both visual cues and temporal structure. By aligning visual cues with textual guidance, this understanding supports procedural learning and multimodal reasoning, providing a richer context for skill acquisition. However, captions that fail to capture this structure may lack coherence and quality, which can create confusion and ultimately undermine the video&#39;s educational intent. To address this gap, we introduce DynaStride, a pipeline to generate coherent, scene-level captions without requiring manual scene segmentation. Using the YouCookII dataset’s scene annotations, DynaStride performs adaptive frame sampling and multimodal windowing to capture key transitions within each scene. It then employs a multimodal chain-of-thought process to produce multiple action-object pairs, which are refined and fused using a dynamic stride window selection algorithm that adaptively balances temporal context and redundancy. The final scene-level caption thus integrates visual semantics and temporal reasoning in a single instructional caption. Empirical evaluations against strong baselines, including VLLaMA3 and GPT-4o, demonstrate consistent gains on both N-gram–based metrics (BLEU, METEOR) and semantic similarity measures (BERTScore, CLIPScore). Qualitative analyzes further show that DynaStride produces captions that are more temporally coherent and informative, suggesting a promising direction for improving AI-powered instructional content generation.

AAAI 2026

DynaStride: Dynamic Stride Windowing with MMCoT for Instructional Multi-Scene Captioning

Scene-level captioning in instructional videos can enhance learning by requiring an understanding of both visual cues and temporal structure. By aligning visual cues with textual guidance, this understanding supports procedural learning and multimodal reasoning, providing a richer context for skill acquisition. However, captions that fail to capture this structure may lack coherence and quality, which can create confusion and ultimately undermine the video's educational intent. To address this gap, we introduce DynaStride, a pipeline to generate coherent, scene-level captions without requiring manual scene segmentation. Using the YouCookII dataset’s scene annotations, DynaStride performs adaptive frame sampling and multimodal windowing to capture key transitions within each scene. It then employs a multimodal chain-of-thought process to produce multiple action-object pairs, which are refined and fused using a dynamic stride window selection algorithm that adaptively balances temporal context and redundancy. The final scene-level caption thus integrates visual semantics and temporal reasoning in a single instructional caption. Empirical evaluations against strong baselines, including VLLaMA3 and GPT-4o, demonstrate consistent gains on both N-gram–based metrics (BLEU, METEOR) and semantic similarity measures (BERTScore, CLIPScore). Qualitative analyzes further show that DynaStride produces captions that are more temporally coherent and informative, suggesting a promising direction for improving AI-powered instructional content generation.

workshop paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Synthetic data generation offers promise for addressing data scarcity and privacy concerns in educational technology, yet practitioners lack empirical guidance for selecting between traditional resampling techniques and modern deep learning approaches. This study presents the first systematic benchmark comparing these paradigms using a 10,000-record student performance dataset. We evaluate three resampling methods (SMOTE, Bootstrap, Random Oversampling) against three deep learning models (Autoencoder, Variational Autoencoder, Copula-GAN) across multiple dimensions: distributional fidelity (Kolmogorov-Smirnov distance, Jensen-Shannon divergence), machine learning utility (Train-on-Synthetic-Test-on-Real scores), and privacy preservation (Distance to Closest Record). Our findings reveal a fundamental trade-off: resampling methods achieve near-perfect utility (TSTR: 0.997) but completely fail privacy protection (DCR: 0.00), while deep learning models provide strong privacy guarantees (DCR: 1.00) at significant utility cost. Variational Autoencoders emerge as the optimal compromise, maintaining 83.3% predictive performance while ensuring complete privacy protection. We provide actionable recommendations: use traditional resampling for internal development where privacy is controlled, and VAEs for external data sharing where privacy is paramount. This work establishes a foundational benchmark and practical decision framework
for synthetic data generation in learning analytics.

Synthetic Data in Education: Empirical Insights from Traditional Resampling and Deep Generative Models

Teachable agents powered by AI offer a promising approach to enhance the engagement and understanding of middle school students, as they often face challenges in grasping mathematical concepts and procedures. However, during student-AI interactions, it is essential to determine when the agent should stop to effectively regulate the cognitive and emotional states of the students, which are factors closely linked to positive participation and productive tutoring strategies. This raises a key question: who should decide when the teachable agent stops or continues? Empirical evidence highlights the advantages of using LLM-as-a-judge and knowledge-graph–based decision-making, as well as the potential benefits of fixed-turn conversations. Drawing on 64,060 messages from 7,991 conversations across four randomly assigned stopping mechanisms - 8-turns, 16-turns, standalone LLM-as-a-judge, and agent decisions with knowledge graphs (KGs), our experimental results highlight that (a) agent decisions informed by KGs were most effective at detecting when conversations should continue, sustaining learner engagement with the teachable agent; (b) off-topic utterances occurred more frequently under fixed-turn conditions; and (c) tutoring strategies such as questioning for help or evaluation and elaborating with justification were most prevalent when agents used KGs for decision-making. Practical implications for designing effective agent-student interactions are discussed.

When to Stop? An Experimental Study on AI Teachable Agent Stopping Mechanisms and Their Learning Affordances in Mathematics

Learning-by-teaching has been shown to foster both conceptual understanding and procedural fluency in mathematics. With recent advancements in large language models (LLMs), students can now assume the roles of tutors, and AI-powered teachable agents can serve as tutees. This study examines individual learning differences in middle school mathematics when interacting with an AI teachable agent, focusing on how students’ individual characteristics and their effective use of tutoring strategies shape learning outcomes, bridging a gap in the literature. Data from 192 students were analyzed using linear mixed models and lag sequential analysis (LSA). Results suggest that baseline mathematics achievement and English Language Arts (ELA) literacy significantly predicted performance, while the effects of tutoring strategies varied with respect to gender and socioeconomic status on a high-stakes math assessment. LSA further revealed (a) more frequent self-reflection transitions and sending off-topic messages among male learners, and (b) associations between socioeconomic status and learners’ regulation and adaptation of tutoring strategies during agent-mediated interactions. Implications for adaptive AI tutoring design and future research directions are discussed.

Does AI-Powered Teachable Agent Benefit Students Evenly? A Look at the Relationship between Individual Characteristics and State Mathematical Exam Results

Large Language Models (LLMs) like LLaMA, Mistral, and Gemma are increasingly used in decision-critical domains such as healthcare, law, and finance, yet their reliability re- mains uncertain. They often make overconfident errors, de- grade under input shifts, and lack clear uncertainty estimates. Existing evaluations are fragmented, addressing only isolated aspects. We introduce the Composite Reliability Score (CRS), a uni- fied framework that integrates calibration, robustness, and uncertainty quantification into a single interpretable met- ric. Through experiments on ten leading open-source LLMs across five QA datasets, we assess performance under base- lines, perturbations, and calibration methods. CRS delivers stable model rankings, uncovers hidden failure modes missed by single metrics, and highlights that the most dependable systems balance accuracy, robustness, and calibrated uncertainty

Beyond Hallucinations: A Composite Score for Measuring Reliability in Open-Source Large Language Models

Large Language Models (LLMs) are increasingly deployed in real-world applications where users engage in extended, mixed-topic conversations that depend on prior context. Yet, their reliability under realistic multi-turn interactions remains poorly understood. We conduct a systematic evaluation of conversational reliability through three representative tasks that reflect practical interaction challenges: (1) maintaining global constraints across topic shifts, (2) selecting the correct tool or agent amid interleaved intents, and (3) tracking structured entities under revisions and distractions. Each task pairs single-turn and multi-turn settings, allowing us to quantify reliability degradation under extended dialogue. Across both commercial and open-source models, we observe substantial declines in reliability, particularly for smaller models. Error analyses reveal recurring failure modes such as instruction drift, intent confusion, and contextual overwriting, which compromise dependable behavior in operational systems. Our findings highlight the need for stress-testing LLMs for conversational reliability and developing more robust evaluation methods for trustworthy deployment.

Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction

In the context of Visual Question Answering (VQA) and Agentic AI, calibration refers to how closely an AI system's confidence in its answers reflects their actual correctness. While modern VQA systems, powered by advanced vision-language models (VLMs), are increasingly used in high-stakes domains like medical diagnostics and autonomous navigation due to their improved accuracy, the reliability of their confidence estimates remains under-examined. Particularly, these systems often produce overconfident responses. To address this, we introduce AlignVQA, a debate-based multi-agent framework, in which diverse specialized VLM -- each following distinct prompting strategies -- generate candidate answers and then engage in two-stage interaction: generalist agents critique, refine and aggregate these proposals. Furthermore, we introduce a novel differentiable calibration-aware loss function called AlignCal designed to fine-tune the specialized agents by minimizing an upper bound on the calibration error. This objective explicitly improves the fidelity of each agent’s confidence estimates. Empirical results across multiple benchmark VQA datasets substantiate the efficacy of our approach, demonstrating substantial reductions in calibration discrepancies.

AlignVQA: Debate-Driven Multi-Agent Calibration for Vision Language Models

Foundation models deployed in dynamic domains like robotics and autonomous systems suffer from critical reliability failures, including temporal inconsistencies and vulnerability to sensor noise, stemming from their training on static, disconnected images. To bridge this reliability gap, we propose a lightweight, reliability-aware training paradigm that distills temporal knowledge from video into a standard single-image encoder. By training a predictor to estimate the feature representation of a future frame, our method implicitly forces the backbone model to learn real-world dynamics, enhancing robustness to transient visual artifacts and promoting temporally stable representations. This self-supervised objective instills geometric and physical priors without relying on brittle external modules like optical flow estimators. Remarkably, when pre-trained on only a single, 2-hour uncurated video, our method achieves state-of-the-art among DINO-style approaches on downstream tasks like detection and segmentation, which we use as quantifiable proxies for robust scene understanding. Our work presents a practical and efficient approach for improving the trustworthiness and dependable performance of vision encoders for safe deployment in operational settings.

Next-Frame Prediction as a Reliability-Aware Training Paradigm for Robust Vision Encoders

Large Language Models are increasingly deployed as judges (LaaJ) in code generation pipelines. While attractive for scalability, LaaJs tend to overlook domain-specific issues raising concerns about their reliability in critical evaluation tasks. To better understand these limitations in practice, we examine LaaJ behavior in a concrete industrial use case: legacy code modernization via COBOL code generation. In this setting, we find that even production-deployed LaaJs can miss domain-critical errors, revealing consistent blind spots in their evaluation capabilities. To better understand these blind spots, we analyze generated COBOL programs and associated LaaJs judgments, drawing on expert knowledge to construct a preliminary taxonomy. Based on this taxonomy, we develop a lightweight analytic checker tool that flags over 30 domain-specific issues observed in practice. We use its outputs as {\it analytic hints}, dynamically injecting them into the judge’s prompt to encourage LaaJ to revisit aspects it may have overlooked. Experiments on a test set of 100 programs using four production-level LaaJs show that LaaJ alone detects only about 45% of the errors present in the code (in all judges we tested), while the analytic checker alone lacks explanatory depth. When combined, the LaaJ+Hints configuration achieves up to 94% coverage (for the best-performing judge and injection prompt) and produces qualitatively richer, more accurate explanations, demonstrating that analytic–LLM hybrids can substantially enhance evaluation reliability in deployed pipelines.

Beyond Blind Spots: Analytic Hints for Mitigating LLM-Based Evaluation Pitfalls

Large Language Models (LLMs) produce strong results but are costly to serve. Static post-training quantization reduces memory and compute, yet uses a single bit width for all prompts, wasting resources on easy inputs and degrading accuracy on harder ones. We introduce Prompt-Adaptive Quantization (PAQ), a per-prompt precision framework that requires no retraining of the underlying model. PAQ trains a lightweight BERT-based router with perplexity-guided supervision to select the smallest adequate quantization level (2, 4, 8, or 16 bits) per input. At inference, prompts are automatically routed to the appropriate pre-quantized LLM variant. Overall, PAQ serves as a novel framework for adaptive per-prompt quantization, reducing latency while maintaining strong accuracy across tasks.

Prompt-Adaptive Quantization: Adaptive Per-Prompt Routing for Efficient LLM Inference

Large language models (LLMs) often generate fluent but factually incorrect statements despite having access to relevant evidence, a failure mode rooted in how they allocate attention between contextual and parametric knowledge. Understanding and steering this internal behavior is key both for trustworthy deployment and for scientific interpretability of model mechanisms. We introduce COMPASS (Context-Modulated PID Attention Steering System), a lightweight, interpretable control framework that embeds a model-based feedback loop directly within decoding. COMPASS quantifies context reliance via a transparent metric, the Context Reliance Score (CRS), which serves as an online probe of how attention heads ground generation in evidence. Using this interpretable signal, a PID controller dynamically modulates attention heads to maintain factual consistency without retraining or multi-pass decoding. Across benchmarks (HotpotQA, XSum, HaluEval, RAGTruth), COMPASS consistently reduces contextual hallucination rates (2.8–5.8% absolute) while revealing how distinct attention heads contribute to evidence alignment. These results highlight feedback-driven interpretability as a pathway toward scientific understanding of LLM behavior.

Premium content

Next from AAAI 2026

Synthetic Data in Education: Empirical Insights from Traditional Resampling and Deep Generative Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES