Singapore

Multimodal Large Language Models (MLLMs) are susceptible to the implicit reasoning risk, wherein innocuous unimodal inputs synergistically assemble into risky multimodal data that produce harmful outputs. We attribute this vulnerability to the difficulty of MLLMs maintaining safety alignment through long-chain reasoning.To address this issue, we introduce Safe-Semantics-but-Unsafe-Interpretation (SSUI), the first dataset featuring interpretable reasoning paths tailored for such a cross-modal challenge.A novel training framework, Safety-aware Reasoning Path Optimization (SRPO), is also designed based on the SSUI dataset to align the MLLM&#39;s internal reasoning process with human safety values. Experimental results show that our SRPO-trained models achieve state-of-the-art results on key safety benchmarks, including the proposed Reasoning Path Benchmark (RSBench), significantly outperforming both open-source and top-tier commercial MLLMs.

AAAI 2026

When Safe Unimodal Inputs Collide: Optimizing Reasoning Chains for Cross-Modal Safety in Multimodal Large Language Models

peai

Multimodal Large Language Models (MLLMs) are susceptible to the implicit reasoning risk, wherein innocuous unimodal inputs synergistically assemble into risky multimodal data that produce harmful outputs. We attribute this vulnerability to the difficulty of MLLMs maintaining safety alignment through long-chain reasoning.To address this issue, we introduce Safe-Semantics-but-Unsafe-Interpretation (SSUI), the first dataset featuring interpretable reasoning paths tailored for such a cross-modal challenge.A novel training framework, Safety-aware Reasoning Path Optimization (SRPO), is also designed based on the SSUI dataset to align the MLLM's internal reasoning process with human safety values. Experimental results show that our SRPO-trained models achieve state-of-the-art results on key safety benchmarks, including the proposed Reasoning Path Benchmark (RSBench), significantly outperforming both open-source and top-tier commercial MLLMs.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Parameter-efficient transfer learning (PETL) has emerged as a pivotal paradigm for adapting pre-trained foundation models to downstream tasks, significantly reducing trainable parameters yet suffering from substantial memory overhead caused by gradient backpropagation during fine-tuning. While memory-efficient transfer learning (METL) circumvents this challenge by bypassing backbone gradient computation via lightweight small side networks, its stringent memory constraint severely limits learning capacity of side networks, thereby significantly compromising performance. To address these limitations, we propose a novel Mixed-Precision Interactive Side Mixture-of-Experts framework (MP-ISMoE). Specifically, we first propose an Gaussian Noise Perturbed Iterative Quantization (GNP-IQ) scheme to quantize weights into lower-bits while effectively decreasing quantization errors. By leveraging memory conserved from GNP-IQ, we subsequently employ Interactive Side Mixture-of-Experts (ISMoE) to scale up side networks without sacrificing overall memory efficiency. Different from conventional mixture-of-experts, ISMoE learns to select optimal experts by interacting with salient features from frozen backbones, thus suppressing knowledge forgetting and boosting performance. Extensive experiments across diverse vision-language and language-only tasks demonstrate that MP-ISMoE remarkably promotes accuracy compared to state-of-the-art METL approaches, while maintaining comparable parameter and memory efficiency. The source code will be publicly available upon acceptance.

MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning

Clean images are crucial for visual tasks such as small object detection, especially at high resolutions. However, real-world images are often degraded by adverse weather, and weather restoration methods may sacrifice high-frequency details critical for analyzing small objects. A natural solution is to apply super-resolution (SR) after weather removal to recover both clarity and fine structures. However, simply cascading restoration and SR struggle to bridge their inherent conflict: removal aims to remove high-frequency weather-induced noise, while SR aims to hallucinate high-frequency textures from existing details, leading to inconsistent restoration contents. In this paper, we take deraining as a case study and propose DHGM, a Diffusion-based High-frequency Guided Model for generating clean and high-resolution images. DHGM integrates pre-trained diffusion priors with high-pass filters to simultaneously remove rain artifacts and enhance structural details. Extensive experiments demonstrate that DHGM achieves superior performance over existing methods, with lower costs.

Seeing Through the Rain: Resolving High-Frequency Conflicts in Deraining and Super-Resolution via Diffusion Guidance

Nucleus detection and classification (NDC) in histopathology analysis is a fundamental task that underpins a wide range of high-level pathology applications. However, existing methods heavily rely on labor-intensive nucleus-level annotations and struggle to fully exploit large-scale unlabeled data for learning discriminative nucleus representations. In this work, we propose MUSE (MUlti-scale denSE self-distillation), a novel self-supervised learning method tailored for NDC. At its core is NuLo (Nucleus-based Local self-distillation), a coordinate-guided mechanism that enables flexible local self-distillation based on predicted nucleus positions. By removing the need for strict spatial alignment between augmented views, NuLo allows critical cross-scale alignment, thus unlocking the capacity of models for fine-grained nucleus-level representation. To support MUSE, we design a simple yet effective encoder-decoder architecture and a large field-of-view semi-supervised fine-tuning strategy that together maximize the value of unlabeled pathology images. Extensive experiments on three widely used benchmarks demonstrate that MUSE effectively addresses the core challenges of histopathological NDC. The resulting models not only surpass state-of-the-art supervised baselines but also outperform generic pathology foundation models.

MUSE: Multi-Scale Dense Self-Distillation for Nucleus Detection and Classification

The slow sampling speed of diffusion models hinders their application in 3D LiDAR scene completion. To address this, we propose Distillation-DPO, a novel framework that accelerates sampling through score distillation while simultaneously enhancing generation quality via preference alignment. Distillation-DPO follows a three-step procedure. First, the student model generates paired completion scenes with different initial noises. Second, using LiDAR scene evaluation metrics as preference, we construct winning and losing sample pairs. Third, as our core innovation, Distillation-DPO optimizes the student model by exploiting the difference in score functions between the teacher and student models on the paired completion scenes. This operation performs variational score distillation of the student model but simultaneously encourages the distilled student to prefer the winning samples over the losing ones. Extensive experiments demonstrate that Distillation-DPO achieves higher-quality scene completion than state-of-the-art diffusion models, while accelerating sampling by over 5-fold. To our knowledge, our work is the first to integrate the preference learning principle of DPO into the distillation of diffusion models, offering a new paradigm of preference-aligned distillation.

Diffusion Distillation with Direct Preference Optimization for Efficient 3D LiDAR Scene Completion

Multivariate time series forecasting is crucial across a wide range of domains. While presenting notable progress for the Transformer architecture, iTransformer still lags behind the latest MLP-based models. We attribute this performance gap to unstable inter-channel relationships. To bridge this gap, we propose EMAformer, a simple yet effective model that enhances the Transformer with an auxiliary embedding suite, akin to armor that reinforces its ability. By introducing three key inductive biases, i.e., \textit{global stability}, \textit{phase sensitivity}, and \textit{cross-axis specificity}, EMAformer unlocks the further potential of the Transformer architecture, achieving state-of-the-art performance on 12 real-world benchmarks and reducing forecasting errors by an average of 2.73\% in MSE and 5.15\% in MAE. This significantly advances the practical applicability of Transformer-based approaches for multivariate time series forecasting. The code is available on \url{https://github.com/PlanckChang/EMAformer}.

EMAformer: Enhancing Transformer Through Embedding Armor for Time Series Forecasting

Attribute-specific fashion retrieval aims to enhance fine-grained image retrieval by emphasizing the similarity of specific attributes. Current methods primarily rely on attention mechanisms to extract attribute-related visual features but face two key challenges: the limitations of coarse-grained localization in achieving fine-grained accuracy, and an imbalance between global and local perception, where excessive focus on local features can undermine overall performance. To address these issues, we propose the fashion microscope ***Pro*****Fashion**, which achieves pixel-level attribute awareness through optimal transport and neural semantic aggregation. The framework begins by employing optimal transport to align semantic attributes with visual patterns from a global perspective, generating an attribute-visual value map that highlights distinctive regions while reducing interference. This is followed by simulating the human brain's perception of attribute feature patterns through superpixel generation and aggregation, capturing attribute-related features at the pixel semantic level and forming key semantic clusters that preserve microstructures. Building on this, an attribute graph is constructed to facilitate feature clustering, significantly enhancing the framework's capability to handle overlapping features and cross-scale relationships. Comprehensive experiments on the *FashionAI*, *DeepFashion*, and *DARN* datasets demonstrate the framework's effectiveness, achieving overall MAP improvements of **3.11%**, **3.70%**, and **3.49%**, respectively. Additionally, the framework delivers relative average throughput gains of **26.94%**, **22.22%**, and **24.78%** on the *FashionAI*, *DeepFashion*, and *DARN* datasets, respectively.

Fashion Microscope: Pixel-Level Attribute Perception via Optimal Transport and Neural Semantic Aggregation

While vision-language models (VLMs) have demonstrated remarkable performance across various tasks combining textual and visual information, they continue to struggle with fine-grained visual perception tasks that require detailed pixel-level analysis. Effectively eliciting comprehensive reasoning from VLMs on such intricate visual elements remains an open challenge. In this paper, we present VipAct, an agent framework that enhances VLMs by integrating multi-agent collaboration and vision expert models, enabling more precise visual understanding and comprehensive reasoning. VipAct consists of an orchestrator agent, which manages task requirement analysis, planning, and coordination, along with specialized agents that handle specific tasks such as image captioning and vision expert models that provide high-precision perceptual information. This multi-agent approach allows VLMs to better perform fine-grained visual perception tasks by synergizing planning, reasoning, and tool use. We evaluate VipAct on benchmarks featuring a diverse set of visual perception tasks, with experimental results demonstrating significant performance improvements over state-of-the-art baselines across all tasks. Furthermore, comprehensive ablation studies reveal the critical role of multi-agent collaboration in eliciting more detailed System-2 reasoning and highlight the importance of image input for task planning. Additionally, our error analysis identifies patterns of VLMs' inherent limitations in visual perception, providing insights into potential future improvements. VipAct offers a flexible and extensible framework, paving the way for more advanced visual perception systems across various real-world applications.

VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use

Spiking Neural Networks (SNNs) offer promising energy efficiency and temporal sparsity for edge intelligence, but their training remains difficult due to gradient mismatch, membrane potential drift, and discretization errors. In this paper, we propose a membrane potential guided surrogate optimization framework that dynamically aligns the surrogate function with the membrane potential distribution to enhance gradient propagation. Specifically, we introduce a KL-divergence-based regularization to stabilize membrane potential dynamics, and an adaptive width constraint to synchronize the surrogate gradient range with neural activity statistics. Additionally, we design a spike discretization error metric and a correction strategy to mitigate temporal discretization effects. Experiments on CIFAR-10, CIFAR-100, and ImageNet show our method achieves 94.15\%, 72.20\%, and 65.70\% top-1 accuracy respectively, while improving gradient stability and energy efficiency. This work provides a principled optimization scheme for robust and scalable SNN training in practical neuromorphic systems.

Optimization Method for Surrogate Function in Spiking Neural Networks Based on Membrane Potential Distribution

Large language models sometimes inadvertently reproduce passages that are copyrighted, exposing downstream applications to legal risk. Most existing studies for inference-time defences focus on surface-level token matching and rely on external blocklists or filters, which add deployment complexity and may overlook semantically paraphrased leakage. In this work, we reframe copyright infringement mitigation as intrinsic semantic-space control and introduce SCOPE, an inference-time method that requires no parameter updates or auxiliary filters. Specifically, the sparse autoencoder (SAE) projects hidden states into a high-dimensional, near-monosemantic space; benefiting from this representation, we identify a copyright-sensitive subspace and clamp its activations during decoding. Experiments on widely recognized benchmarks show that SCOPE mitigates copyright infringement without degrading general utility. Further interpretability analyses confirm that the isolated subspace captures high-level semantics.

SCOPE: Intrinsic Semantic Space Control for Mitigating Copyright Infringement in LLMs

Large Vision-Language Models (LVLMs) enhance performance on vision-language tasks by integrating visual features from pre-trained vision encoders into large language models (LLMs). However, the large number of visual tokens introduces significant computational overhead. Existing token pruning methods either perform global selection via [CLS]-based attention in the vision encode or prune within LLM decoding layers. These approaches face two key challenges: (1) [CLS]-based attention primarily focuses on visually salient regions across the entire image, often overlooking semantically important tokens essential for reasoning; and (2) strong positional bias in the shallow decoder layers causes the model to favor later-positioned tokens, while neglecting earlier ones that may carry critical reasoning cues. To address these issues, we propose PosPrune, a training-free, two-stage visual token pruning framework. At the vision encoder, we introduce an Asymmetric Region-aware Pruning (ARP) strategy that retains more tokens in semantically rich regions while discarding more tokens from semantically less informative regions, thus preserving spatial diversity and task-relevant details. In the LLM decoding stage, we find that the positional bias in shallow layers is primarily driven by model architecture rather than task semantics. Based on this insight, we propose a novel Positional Bias Correction (PBC) mechanism to mitigate this bias. To further reduce redundancy, we apply Maximal Marginal Relevance (MMR) to select tokens that best balance textual relevance and diversity. Extensive experiments on various LVLMs and benchmarks demonstrate the general effectiveness of our approach. Notably, when applied to LLaVA-1.5-7B, PosPrune achieves a reduction of 85% in FLOPs while preserving 98.5% of the original performance.

Downloads

Next from AAAI 2026

MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads