Singapore

Large language models (LLMs) are increasingly deployed worldwide, yet their safety alignment remains predominantly English-centric. This allows for vulnerabilities in non-English contexts, especially with low-resource languages. We introduce a novel application of knowledge distillation (KD) in the context of multilingual jailbreak prevention, examining its efficacy. We distill the refusal behaviors of a proprietary teacher model ($\texttt{OpenAI o1-mini}$) with Low-Rank Adaptation (LoRA) into three open-source student models: $\texttt{Meta-Llama-3-8B-Instruct}$, $\texttt{Gemma-2-2B-IT}$, and $\texttt{Qwen3-8B}$, using ~28,000 multilingual jailbreak prompts from $\texttt{XSafety}$ via response-based, parameter-efficient fine-tuning (PEFT). Evaluation on the $\texttt{MultiJail}$ benchmark reveals a counterintuitive behavior: fine-tuning on the teacher&#39;s ``safe&#39;&#39; refusal data inadvertently increases Jailbreak Success Rate (JSR) for all student models, up to 16.6 percentage points. Our experiments reveal a divergent generalization to unseen languages during distillation, with varying outcomes depending on the base model. Overall, our exploratory study highlights the challenges and potential of KD as a technique for multilingual safety alignment, offering a foundation for future research in this direction.

AAAI 2026

Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety

Large language models (LLMs) are increasingly deployed worldwide, yet their safety alignment remains predominantly English-centric. This allows for vulnerabilities in non-English contexts, especially with low-resource languages. We introduce a novel application of knowledge distillation (KD) in the context of multilingual jailbreak prevention, examining its efficacy. We distill the refusal behaviors of a proprietary teacher model ($\texttt{OpenAI o1-mini}$) with Low-Rank Adaptation (LoRA) into three open-source student models: $\texttt{Meta-Llama-3-8B-Instruct}$, $\texttt{Gemma-2-2B-IT}$, and $\texttt{Qwen3-8B}$, using ~28,000 multilingual jailbreak prompts from $\texttt{XSafety}$ via response-based, parameter-efficient fine-tuning (PEFT). Evaluation on the $\texttt{MultiJail}$ benchmark reveals a counterintuitive behavior: fine-tuning on the teacher's ``safe'' refusal data inadvertently increases Jailbreak Success Rate (JSR) for all student models, up to 16.6 percentage points. Our experiments reveal a divergent generalization to unseen languages during distillation, with varying outcomes depending on the base model. Overall, our exploratory study highlights the challenges and potential of KD as a technique for multilingual safety alignment, offering a foundation for future research in this direction.

workshop paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Large Language Models (LLMs) exhibit impressive generative capabilities but remain vulnerable to adversarial inputs, exposing potential risks such as data leakage, harmful content generation, and jailbreak attacks. Jailbreak attacks can fool LLMs and elicit harmful output even with safety alignments and guardrails. In this work, we performed a case study using AutoDAN – an automated adversarial attack generator – to stress-test open-source LLMs. We measure baseline attack success rates (ASRs) and then apply various fine-tuning defenses to mitigate these vulnerabilities. We used AutoDAN to construct adversarial datasets and evaluate the robustness of fine-tuned LLMs. Using llama-3-8b-instruct as the base model, we apply full Supervised Fine-Tuning (SFT) AutoDAN-style attacks. Our preliminary experiments show a reduction in jailbreak success rate after fine-tuning, while maintaining usefulness and coherence for benign queries. We conclude by outlining best practices for deploying adversarially resilient LLMs in production environments and future work to continue research with adversarial attack vulnerabilities and agentic workflows.

Adversarial Testing for Large Language Models: Evaluating and Enhancing Robustness with AutoDAN and Fine-Tuning Techniques

The performance of edge language models is fundamentally determined by the quality of their training data. To address the challenge of efficient data curation in resource-constrained environments, this study adapts and optimizes the Predictive Data Selection (Preselect) methodology. Our approach focuses on enhancing two core capabilities crucial for edge AI applications: ChatRAG, as the foundation for knowledge interaction, and Function Calling, as the basis for tool use.

By designing an evaluation ensemble that includes specialized models and training a FastText lightweight classifier, we can efficiently filter high-value training samples from massive datasets. Experimental results demonstrate that this strategy yields significant performance improvements, particularly in ChatRAG (+10.5%) and Function Calling (+10.0%). This research validates that an edge-optimized Preselect is an effective and viable strategy for enhancing targeted capabilities in edge models, ultimately proving that under resource constraints, curated data quality is a more critical driver of performance than mere data quantity.

Quality Over Quantity: Predictive Data Selection for Edge Language Models

Radiology worklists are typically processed first-in–first-out (FIFO) even when studies differ greatly in clinical urgency. We propose a pragmatic alternative: using calibrated probabilities of intracranial hemorrhage (ICH) to prioritize head CT exams for earlier reading. Using the public RSNA-ICH dataset, we train slice-level detectors, aggregate to exam-level, apply post-hoc calibration, and feed these scores into a transparent discrete-event simulator of the reading queue. The simulator quantifies how triage benefit reduction in median time-to-read (TTR) for ICH; scales with classifier AUC, workload (arrival rate), staffing, prevalence, and calibration. Across realistic loads, score-based prioritization yields substantial TTR reductions for ICH with minimal delay to non-ICH studies. We release a configuration-driven, reproducible pipeline that translates AI risk scores into operational metrics (minutes saved), enabling safe and data-driven evaluation before PACS/RIS deployment.

Translating Classifier Scores into Clinical Impact: Calibrated Risk and Queueing Simulation for AI-Assisted Radiology Worklist Triage

Meta-learning enables models to rapidly adapt to new tasks by leveraging prior experience, but its adaptation mechanisms remain opaque, especially regarding how past training tasks influence future predictions. We introduce TLXML (Task-Level eXplanation of Meta-Learning), a novel framework that extends influence functions to meta-learning settings, enabling task-level explanations of adaptation and inference. By reformulating influence functions for bi-level optimization, TLXML quantifies the contribution of each meta-training task to the adapted model’s behaviour. To ensure scalability, we propose a Gauss-Newton-based approximation that significantly reduces computational complexity from $O(pq^2)$ to $O(pq)$, where p and q denote model and meta parameters, respectively. Empirical results demonstrate that TLXML effectively ranks training tasks by their influence on downstream performance, offering concise and intuitive explanations aligned with user-level abstraction. This work provides a critical step toward interpretable and trustworthy meta-learning systems.

A Task-Level Explanation Framework for Meta-Learning Algorithms

As Reinforcement Learning (RL) agents are increasingly deployed in real-world applications, ensuring their behavior is transparent and trustworthy is paramount. A key component of trust is explainability, yet much of the work in Explainable RL (XRL) focuses on local, single-step decisions. This paper addresses the critical need for explaining an agent's long-term behavior through trajectory-level analysis. We introduce a novel framework that ranks entire trajectories by defining and aggregating a new state-importance metric. This metric combines the classic Q-value difference with a "radical term" that captures the agent's affinity to reach its goal, providing a more nuanced measure of state criticality. We demonstrate that our method successfully identifies optimal trajectories from a heterogeneous collection of agent experiences. Furthermore, by generating counterfactual rollouts from critical states within these trajectories, we show that the agent's chosen path is robustly superior to alternatives, thereby providing a powerful "Why this, and not that?" explanation. Our experiments in standard OpenAI Gym environments validate that our proposed importance metric is more effective at identifying optimal behaviors compared to classic approaches, offering a significant step towards trustworthy autonomous systems.

Know your Trajectory - Trustworthy Reinforcement Learning deployment through Importance-Based Trajectory Analysis

This paper introduces a novel AI-driven approach for extracting actionable insights from corporate communications by quantifying strategic ambiguity in language. While prior work in natural language analysis has largely focused on sentiment or factual content, we explore how organizations deliberately hedge, obscure, or soften information, using linguistic ambiguity as a rich signal of intent and hidden meaning. We propose the Strategic Ambiguity Score (SAS), a machine learning model that captures deliberate vagueness by integrating hedge frequency, negation patterns, and model-based attention to critical phrases. Unlike traditional sentiment models, SAS measures not just what is said, but how and where uncertainty is strategically embedded within the text. We demonstrate that SAS can effectively highlight subtle signals that correlate with subsequent outcomes, and we illustrate its utility through predictive analyses in corporate disclosures. By shifting the focus from simple sentiment interpretation to ambiguity detection, this work provides a generalizable framework for AI applications in decision-making, risk assessment, and strategic communication analysis across diverse domains.

Quantifying Strategic Ambiguity in Corporate Language for AI-Driven Trading Strategies

The rapid integration of AI into education has prioritized capability over trustworthiness, creating significant risks. Real-world deployments reveal that even advanced models are insufficient without extensive architectural scaffolding to ensure reliability. Current evaluation frameworks are fragmented: institutional policies lack technical verification, pedagogical guidelines assume AI reliability, and technical metrics are context-agnostic. This leaves institutions without a unified standard for deployment readiness. This paper introduces TEAS (Trusted Educational AI Standard), an integrated framework built on four interdependent pillars: (1) Verifiability, grounding content in authoritative sources; (2) Stability, ensuring deterministic core knowledge; (3) Auditability, enabling independent institutional validation; and (4) Pedagogical Soundness, enforcing principles of active learning. We argue that trustworthiness stems primarily from systematic architecture, not raw model capability. This insight implies that affordable, open-source models can achieve deployment-grade trust, offering a scalable and equitable path to integrating AI safely into learning environments globally.

TEAS: Trusted Educational AI Standard: A Framework for Verifiable, Stable, Auditable, and Pedagogically Sound Learning Systems

Large Language Models (LLMs) adapted through Low Rank Adaptation (LoRA) often exhibit weakened safety alignment, even when fine tuned on benign datasets. Such degradation poses significant risks for deployable AI systems, where parameter updates can unintentionally introduce unsafe or unstable behaviors. In this work, we propose Directional Deviation Index Guided Pruning (DDI Pruning), a post hoc and data free framework for diagnosing and mitigating unsafe LoRA adaptations. DDI quantifies the spectral and directional deviation of each LoRA updated layer relative to its pretrained baseline, identifying layers that contribute most to instability or misalignment. Layers with high DDI scores are selectively pruned, improving both model robustness and computational efficiency without additional training or supervision. We evaluate the proposed approach on multiple language generation and agent planning benchmarks using several LLM backbones. Results show that DDI Pruning consistently reduces harmful or adversarial behaviors while preserving task accuracy and coherence. Ablation studies further demonstrate that each component of DDI contributes to capturing unsafe adaptation patterns, highlighting its interpretability and generality across domains. Overall, DDI Pruning provides an effective and practical mechanism for enhancing the safety alignment of adapted LLMs and contributes to the development of reliable and deployable AI systems.

Safe and Deployable LLM Adaptation: Directional Deviation Index–Guided Model Pruning

The paper explores how video models trained for classification tasks represent nuanced, hidden semantic information that may not affect the final outcome, a key challenge for Trustworthy AI models. Through Explainable and Interpretable AI methods, specifically mechanistic interpretability techniques, the internal circuit responsible for representing the action's outcome is reverse-engineered in a pre-trained video vision transformer, revealing that the "Success vs Failure" signal is computed through a distinct amplification cascade. While there are low-level differences observed from layer 0, the abstract and semantic representation of the outcome is progressively amplified from layers 5 through 11. Causal analysis, primarily using activation patching supported by ablation results, reveals a clear division of labor: Attention Heads act as "evidence gatherers", providing necessary low-level information for partial signal recovery, while MLP Blocks function as robust "concept composers", each of which is sufficient to generate the entire "success" signal. This distributed and redundant circuit in the model's internals explains its resilience to simple ablations, demonstrating a core computational pattern for processing human-action outcomes. Crucially, the existence of this sophisticated circuit for representing complex outcomes, even within a model trained only for simple classification, highlights the potential for models to develop forms of 'hidden knowledge' beyond their explicit task, underscoring the need for mechanistic oversight for building genuinely Explainable and Trustworthy AI systems intended for deployment.

Attention Gathers, MLPs Compose: A Causal Analysis of an Action-Outcome Circuit in VideoViT

Semantic representations of rhythmic structures are important for AI-driven music generation and choreography. South Asian classical dance, such as Bharatanatyam, relies on intricate rhythms that guide choreography and improvisation. These rhythms are expressed through Nattuvangam, a vocal and percussive form that uses rhythmic syllables (Solkattus) and cymbal cues (Talam). Despite its pedagogical importance, Nattuvangam is rarely documented in digital form, which limits systematic study and teaching. We present the first curated dataset of Nattuvangam recordings that capture diverse Solkattu patterns and cyclic Talam structures. Each clip is analyzed using handcrafted and learned features, including onset envelopes, inter-onset intervals, tempograms, and Mel-spectrogram embeddings. These representations allow machine learning models to identify, cluster, and retrieve rhythmic motifs across performances. The dataset serves as a pedagogical tool and supports computational exploration of Solkattu patterns in relation to Talam, revealing the structural principles underlying Nattuvangam. This work establishes a foundation for studying Nattuvangam as both a standalone and performative art form, bridging cultural teaching with AI-based rhythm analysis in low-resource contexts.

Premium content

Next from AAAI 2026

Adversarial Testing for Large Language Models: Evaluating and Enhancing Robustness with AutoDAN and Fine-Tuning Techniques

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES