Singapore

Refusal refers to the functional behavior enabling safety-aligned language models to reject harmful or unethical prompts. Following the growing scientific interest in mechanistic interpretability, recent work encoded refusal behavior as a single direction in the model’s latent space; e.g., computed as the difference between the centroids of harmful and harmless prompt representations. However, emerging evidence suggests that concepts in LLMs often appear to be encoded as a low-dimensional manifold embedded in the high-dimensional latent space. Motivated by these findings, we propose a novel method leveraging Self-Organizing Maps (SOMs) to extract multiple refusal directions. To this end, we first prove that SOMs generalize the prior work&#39;s difference-in-means technique. We then train SOMs on harmful prompt representations to identify multiple neurons. By subtracting the centroid of harmless representations from each neuron, we derive a set of multiple directions expressing the refusal concept. We validate our method on an extensive experimental setup, demonstrating that ablating multiple directions from models&#39; internals outperforms not only the single-direction baseline but also specialized jailbreak algorithms, leading to an effective suppression of refusal. Finally, we conclude by analyzing the mechanistic implications of our approach.

AAAI 2026

SOM Directions Are Better than One: Multi-Directional Refusal Suppression in Language Models

nlp: ethics — bias

nlp: safety and robustness

nlp: (large) language models

and evaluation of nlp models

nlp: interpretability

transparency & privacy

fairness

analysis

Refusal refers to the functional behavior enabling safety-aligned language models to reject harmful or unethical prompts. Following the growing scientific interest in mechanistic interpretability, recent work encoded refusal behavior as a single direction in the model’s latent space; e.g., computed as the difference between the centroids of harmful and harmless prompt representations. However, emerging evidence suggests that concepts in LLMs often appear to be encoded as a low-dimensional manifold embedded in the high-dimensional latent space. Motivated by these findings, we propose a novel method leveraging Self-Organizing Maps (SOMs) to extract multiple refusal directions. To this end, we first prove that SOMs generalize the prior work's difference-in-means technique. We then train SOMs on harmful prompt representations to identify multiple neurons. By subtracting the centroid of harmless representations from each neuron, we derive a set of multiple directions expressing the refusal concept. We validate our method on an extensive experimental setup, demonstrating that ablating multiple directions from models' internals outperforms not only the single-direction baseline but also specialized jailbreak algorithms, leading to an effective suppression of refusal. Finally, we conclude by analyzing the mechanistic implications of our approach.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

As backdoor attacks become more stealthy and robust, they reveal critical weaknesses in current defense strategies: detection methods often rely on coarse-grained feature statistics, and purification methods typically require full retraining or additional clean models. To address these challenges, we propose DUP (Detection-guided Unlearning for Purification), a unified framework that integrates backdoor detection with unlearning-based purification. The detector captures feature-level anomalies by jointly leveraging class-agnostic distances and inter-layer transitions. These deviations are integrated through a weighted scheme to identify poisoned inputs, enabling more fine-grained analysis. Based on the detection results, we purify the model through a parameter-efficient unlearning mechanism that avoids full retraining and does not require any external clean model. Specifically, we innovatively repurpose knowledge distillation to guide the student model toward increasing its output divergence from the teacher on detected poisoned samples, effectively forcing it to unlearn the backdoor behavior. Extensive experiments across diverse attack methods and language model architectures demonstrate that DUP achieves superior defense performance in detection accuracy and purification efficacy.

DUP: Detection-guided Unlearning for Backdoor Purification in Language Models

Adapting large language models (LLMs) to new languages is an expensive and opaque process. Understanding how language models acquire new languages and multilingual abilities is key to achieve efficient adaptation. Prior work on multilingual interpretability research focuses primarily on how trained models process multilingual instructions, leaving unexplored the mechanisms through which they acquire new languages during training. We investigate these training dynamics on decoder-only transformers through the lens of two functional cognitive specializations: language perception (input comprehension) and production (output generation). Through experiments on low-resource languages, we demonstrate how perceptual and productive specialization emerges in different regions of a language model by running layer ablation sweeps from the model's input and output directions. Based on the observed specialization patterns, we propose CogSym, a layer-wise heuristic that enables effective adaptation by exclusively fine-tuning a few early and late layers. We show that tuning only the 25% outermost layers achieves downstream task performance within 2-3% deviation from the full fine-tuning baseline. CogSym yields consistent performance with adapter methods such as LoRA, showcasing generalization beyond full fine-tuning. These findings provide insights to better understand how LLMs learn new languages and push toward accessible and inclusive language modeling.

Positional Cognitive Specialization: Where Do LLMs Learn to Comprehend and Speak Your Language?

The mapping from sound to neural activity that underlies hearing is highly non-linear. The first few stages of this mapping in the cochlea have been modelled successfully, initially with biophysical models built by hand and, more recently, with DNN models trained on datasets simulated by the biophysical models. Modelling the auditory brain has been a challenge because central auditory processing is too complex for models to be built by hand, and datasets for training DNN models directly have not been available. Recent work has taken advantage of large-scale high resolution neural recordings from the auditory midbrain to build a DNN model of normal hearing with great success. But this model assumes that auditory processing is the same in all brains, and therefore it cannot capture the widely varying effects of hearing loss. 

We propose a novel variational-conditional model to learn to encode the space of hearing loss directly from recordings of neural activity in the auditory midbrain of healthy and noise exposed animals. With hearing loss parametrised by only 6 free parameters per animal, our model accurately predicts 62\% of the explainable variance in neural responses from normal hearing animals and 68\% for hearing impaired animals, comparable to state of the art animal specific models. We demonstrate that the model can be used to simulate realistic activity from out of sample animals by fitting only the learned conditioning parameters with Bayesian optimisation, achieving crossentropy loss within 2\% of the optimum in 15-30 iterations. Including more animals in the training data slightly improved the performance on unseen animals. This model will enable future development of parametrised hearing loss compensation models trained to directly restore normal neural coding in hearing impaired brains, which can be quickly fitted for a new user by human in the loop optimisation.

Modelling the Effects of Hearing Loss on Neural Coding in the Auditory Midbrain with Variational Conditioning

The surface pressure field of transportation systems, including cars, trains, and aircraft, is critical for aerodynamic analysis and design. In recent years, deep neural networks have emerged as promising and efficient methods for modeling surface pressure field, being alternatives to computationally expensive CFD simulations. Currently, large-scale public datasets are available for domains such as automotive aerodynamics. However, in many specialized areas, such as high-speed trains, data scarcity remains a fundamental challenge in aerodynamic modeling, severely limiting the effectiveness of standard neural network approaches. To address this limitation, we propose the Adaptive Field Learning Framework (AdaField), which pre-trains the model on public large-scale datasets to improve generalization in sub-domains with limited data. AdaField comprises two key components. First, we design the Semantic Aggregation Point Transformer (SAPT) as a high-performance backbone that efficiently handles large-scale point clouds for surface pressure prediction. Second, regarding the substantial differences in flow conditions and geometric scales across different aerodynamic subdomains, we propose Flow-Conditioned Adapter (FCA) and Physics-Informed Data Augmentation (PIDA). FCA enables the model to flexibly adapt to different flow conditions with a small set of trainable parameters, while PIDA expands the training data distribution to better cover variations in object scale and velocity. Our experiments show that AdaField achieves SOTA performance on the DrivAerNet++ dataset and can be effectively transferred to train and aircraft scenarios with minimal fine-tuning. These results highlight AdaField’s potential as a generalizable and transferable solution for surface pressure field modeling, supporting efficient aerodynamic design across a wide range of transportation systems.

AdaField: Generalizable Surface Pressure Modeling with Physics-Informed Pre-training and Flow-Conditioned Adaptation

3D Vision-Language Foundation Models (VLFMs) have shown strong generalization and zero-shot recognition capabilities in open-world point cloud processing tasks. However, these models often underperform in practical scenarios where data are noisy, incomplete, or drawn from a different distribution than the training data.
To address this, we propose \textbf{Uni-Adapter}, a novel training-free online test-time adaptation (TTA) strategy for 3D VLFMs based on dynamic prototype learning. We define a 3D cache to store class-specific cluster centers as prototypes, which are continuously updated to capture intra-class variability in heterogeneous data distributions. These dynamic prototypes serve as anchors for cache-based logit computation via similarity scoring. Simultaneously, a graph-based label smoothing module captures inter-prototype similarities to enforce label consistency among similar prototypes. Finally, we unify predictions from the original 3D VLFM and the refined 3D cache using entropy-weighted aggregation for reliable adaptation. Without retraining, Uni-Adapter effectively mitigates distribution shifts, achieving state-of-the-art performance on diverse 3D benchmarks over different 3D VLFMs—improving ModelNet-40C by \textbf{10.55\%}, ScanObjectNN-C by \textbf{8.26\%}, and ShapeNet-C by \textbf{4.49\%} over the source 3D VLFMs.

Adapt-As-You-Walk Through the Clouds: Training-Free Online Test-Time Adaptation of 3D Vision-Language Foundation Models

For a class of hybrid dynamical systems, we show that a re- current neural network with hybrid dynamics, which we refer to as a hybrid dynamic recurrent neural network (HyRNN), can be constructed to approximate solutions to hybrid sys- tems over bounded (hybrid) time horizons. Specifically, given a desired precision level, we show that a hybrid system with dynamics resembling those of recurrent neural networks for continuous-time and discrete-time systems can be designed so that, for each bounded hybrid time horizon, its solutions are close to the solutions to the given hybrid system. Through the use of universal approximation theorems, we show that the approximation result holds for traditional smooth activa- tion functions, such as sigmoid and arctan, and that exten- sions to ReLU functions are possible, and characterize the complexity of the proposed HyRNN.

HyRNN: Hybrid Recurrent Neural Networks for Approximating Hybrid Dynamical Systems

Multimodal Large Language Models (MLLMs) have shown advanced performance in vision-language tasks. However, existing multimodal reasoning models often suffer from excessive reasoning steps, leading to high computational costs and inefficiency. In this paper, we propose the Multimodal Adaptive Reasoning Model (MARS), which enables adaptive adjustment of the reasoning strategy based on question difficulty. Specifically, MARS adopts a three-stage training framework based on our constructed training dataset (MART): 1) CoT Masking Learning to enhance reasoning logicality by predicting masked reasoning steps. 2) Adaptive Reasoning Instruction Learning to train the model to skip or keep reasoning steps according to difficulty levels. 3) CoT Lightweight Reinforcement Learning with the Information Bottleneck Principle based GRPO algorithm to reduce CoT length while maintaining performance and generalizability. Results on both in-domain and out-of-domain datasets show that MARS significantly reduces the CoT length (90.2% decrease) while improving accuracy (0.54%), outperforming existing SOTA open-source and proprietary MLLMs.

MARS: Multimodal Adaptive Reasoning Model for Avoiding Overthinking

We introduce **LLaMMo** (**L**arge **La**nguage and **M**ulti-Person **Mo**tion Assistant), the first instruction-tuning multimodal framework tailored for multi-human motion analysis. LLaMMo incorporates a novel human-centric and social-temporal learner that models and fuses both intra-person dynamics and inter-person dependencies, yielding robust, context-aware representations of complex group behaviors while maintaining low computational overhead. To support LLaMMo, we construct **LLaVerse**, a large-scale dataset with fine-grained manual annotations covering diverse multi-person activities spanning daily social interaction and professional team sports. Built on top of LLaVerse, we also propose **LLaMI-Bench**, a dedicated benchmark for evaluating multi-human behavior understanding across motion and video modalities. Extensive experiments demonstrate that LLaMMo consistently outperforms baselines in understanding multi-person interactions under low-latency settings, with notable gains in both social and sport-specific contexts.

Multiple Human Motion Understanding

Predicting pedestrian motion trajectories is critical for the path planning and motion control of autonomous vehicles. Recent diffusion-based models have shown promising results in capturing the inherent stochasticity of pedestrian behavior for trajectory prediction. However, the absence of explicit semantic modelling of pedestrian intent in many diffusion-based methods may result in misinterpreted behaviors and reduced prediction accuracy. To address the above challenges, we propose a diffusion-based pedestrian trajectory prediction framework that incorporates both short-term and long-term motion intentions. Short-term intent is modelled using a residual polar representation, which decouples direction and magnitude to capture fine-grained local motion patterns. Long-term intent is estimated through a learnable, token-based endpoint predictor that generates multiple candidate goals with associated probabilities, enabling multimodal and context-aware intention modelling. Furthermore, we enhance the diffusion process by incorporating adaptive guidance and a residual noise predictor that dynamically refines denoising accuracy. The proposed framework is evaluated on the widely used ETH, UCY, and SDD benchmarks, demonstrating competitive results against state-of-the-art methods. Code is available at https://anonymous.4open.science/r/IAD/.

Intention-Aware Diffusion Model for Pedestrian Trajectory Prediction

Multi-Domain Multi-Task (MDMT) recommendation aims to provide personalized recommendations by leveraging information across multiple domains and tasks. However, existing methods often suffer from spurious correlations between irrelevant features and the target, leading to negative transfer. To address this, we propose a Stable and Adaptive Fusion (SAF) framework for MDMT recommendation. SAF introduces a weighted Hilbert-Schmidt Independence Criterion (HSIC) loss to decorrelate irrelevant features from the target, learning sample weights that promote stable (i.e., robust to spurious correlations) representations in both bottom and expert layers. We employ Random Fourier Features (RFF) to enable scalable computation of the HSIC loss. We further employ adaptive feature and expert gating to select these stable features, enabling the model to capture intricate cross-domain and cross-task dependencies. The learned sample weights are also used to reweight the MDMT loss during training. Experiments on large-scale datasets show that SAF outperforms state-of-the-art baselines by up to 2\% in AUC. To facilitate further research, we release a new industrial dataset with 30 million interactions across 3 domains and 2 tasks, with 300 features.

Downloads

Next from AAAI 2026

DUP: Detection-guided Unlearning for Backdoor Purification in Language Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES