Singapore

Text-video retrieval aims to bridge vision and language areas, which is a crucial task in multi-modal intelligence. The core idea is to learn video and textual features to quantify their semantic relevance. A common limitation in current approaches is the oversimplification of video content, where complex spatiotemporal structures are compressed into a single global representation. Consequently, these methods struggle to fully capture dynamic visual variations and discriminative appearance inside a video, further complicating cross-modal alignment. To alleviate these issues, we introduce a novel decoupling approach that independently processes appearance and motion cues, capitalizing on their complementary nature for more expressive video modeling. Specifically, we propose an appearance-motion decomposed network (AMD-Net) to decouple spatial-level appearance and temporal-level motion understanding via the discriminative appearance learning and multi-scale motion learning modules. The proposed model enjoys several merits. First, the designed discriminative appearance learning module with a Singular Value Decomposition (SVD) based prototype initialization can effectively reduce redundant appearance information, and a high-order cross-aggregation mechanism enhances prototype resilience and facilitates comprehensive video understanding. Second, the proposed multi-scale motion learning (MML) module can capture motion features at varying temporal scales, which are complementary to appearance features for accurate text-video retrieval. Extensive experiments on five standard benchmarks demonstrate that our method performs favorably against state-of-the-art methods.

AAAI 2026

Appearance-Motion Decomposed Alignment for Text-Video Retrieval

cv: image and video retrieval，cv: language and vision，cv: multi-modal vision

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Diffusion Transformers (DiTs) are a powerful yet underexplored class of generative models compared to U-Net-based diffusion architectures. We propose TIDE—Temporal-aware sparse autoencoders for Interpretable Diffusion transformErs—a framework designed to extract sparse, interpretable activation features across timesteps in DiTs. TIDE effectively captures temporally-varying representations and reveals that DiTs naturally learn hierarchical semantics (e.g., 3D structure, object class, and fine-grained concepts) during large-scale pretraining. Experiments show that TIDE enhances interpretability and controllability while maintaining reasonable generation quality, enabling applications such as safe image editing and style transfer.

TIDE: Temporal-Aware Sparse Autoencoders for Interpretable Diffusion Transformers in Image Generation

Modality-agnostic Semantic Segmentation (MaSS) aims to achieve robust scene understanding across arbitrary combinations of input modality. Existing methods typically rely on explicit feature alignment to achieve modal homogenization, which dilutes the distinctive strengths of each modality and destroys their inherent complementarity. To achieve cooperative harmonization rather than homogenization, we propose CHARM, a novel complementary learning framework designed to implicitly align content while preserving modality-specific advantages through two components: (1) Mutual Perception Unit (MPU), enabling implicit alignment through window-based cross-modal interaction, where modalities serve as both queries and contexts for each other to discover modality-interactive correspondences; (2) A dual-path optimization strategy that decouples training into Collaborative Learning Strategy (CoL) for complementary fusion learning and Individual Enhancement Strategy (InE) for protected modality-specific optimization. Experiments across multiple datasets and backbones indicate that CHARM consistently outperform the baselines, with significant increment on the fragile modalities. This work shifts the focus from model homogenization to harmonization, enabling cross-modal complementarity for true harmony in diversity.

CHARM: Collaborative Harmonization Across Arbitrary Modalities for Modality-Agnostic Semantic Segmentation

Large Vision-Language Models (LVLMs) have achieved impressive performance across a wide range of multimodal tasks. However, they still face critical challenges in modeling long-range dependencies under the usage of Rotary Positional Encoding (ROPE). Although it can facilitate precise modeling of token positions, it induces progressive attention decay as token distance increases, especially with progressive attention decay over distant token pairs, which severely impairs the model's ability to remember global context. To alleviate this issue, we propose inference-only Three-step Decay Resilience Strategies (T-DRS), comprising (1) Semantic-Driven DRS (SD-DRS), amplifying semantically meaningful but distant signals via content-aware residuals, (2) Distance-aware Control DRS (DC-DRS), which can purify attention by smoothly modulating weights based on positional distances, suppressing noise while preserving locality, and (3) re-Reinforce Distant DRS (reRD-DRS), consolidating the remaining informative remote dependencies to maintain global coherence. Together, the T-DRS recover suppressed long-range token pairs without harming local inductive biases. Extensive experiments on Vision Question Answering (VQA) benchmarks demonstrate that T-DRS can consistently improve performance in a training-free manner. Our code will be available.

Remember Me: Bridging the Long-Range Gap in LVLMs with Three-Step Inference-Only Decay Resilience Strategies

Rectified flow models have shown strong potential in high-fidelity video generation, yet extending them to high-resolution remains challenging due to the high cost of full attention and error accumulation in the ODE-solving process. In this paper, we propose S$^2$Flow, a training-free framework that enables efficient and authentic high-resolution video generation by jointly exploring \textbf{Flow}-guided \textbf{S}parse attention and \textbf{S}econd-order ODE solution. Specifically, S$^2$Flow exploits and transfers the semantic and structural information from the low-resolution flow trajectory to guide the high-resolution flow in two aspects. First, S$^2$Flow dynamically captures the sparse patterns of the spatio-temporal attention maps from low-resolution videos to construct localized 3D windows, enabling efficient window attention in high-resolution inference. This can significantly reduce redundant computation while preserving contextual dependencies. Second, S$^2$Flow adopts a second-order ODE solver based on Taylor expansion, where the high-order derivative is approximated via central difference from the low-resolution flow, facilitating accurate high-resolution denoising. Extensive experiments on VBench dataset demonstrate that S$^2$Flow outperforms prior methods in both visual quality and inference speed, enabling $4\times$ acceleration on $2560 \times 1536$ video generation.

S²Flow: Towards Fast and Authentic Training-Free High-Resolution Video Generation

In this work, we study the problem of offline safe imitation learning (IL). In many real-world settings, online interactions can be risky, and accurately specifying the reward and the safety cost information at each timestep can be difficult. However, it is often feasible to collect trajectories reflecting undesirable or risky behavior, implicitly conveying the behavior the agent should avoid. We refer to these trajectories as non-preferred trajectories. Unlike standard IL, which aims to mimic demonstrations, our agent must also learn to avoid risky behavior using non-preferred trajectories. In this paper, we propose a novel approach, SafeMIL, to learn a parameterized cost that predicts if the state-action pair is risky via Multiple Instance Learning. The learned cost is then used to avoid non-preferred behaviors, resulting in a policy that prioritizes safety. We empirically demonstrate that our approach can learn a safer policy that satisfies cost constraints without degrading the reward performance, thereby outperforming several baselines.

SafeMIL: Learning Offline Safe Imitation Policy from Non-Preferred Trajectories

Multi-view multi-label data offers richer perspectives for artificial intelligence, but simultaneously presents significant challenges for feature selection due to the inherent complexity of interrelations among features, views and labels. Attention mechanisms provide an effective way for analyzing these intricate relationships. They can compute importance weights for information by aggregating correlations between Query and Key matrices to focus on pertinent Values. However, existing attention-based feature selection methods predominantly focus on intra-view relationships, neglecting the complementarity of inter-view features and the critical feature-label correlations. Moreover, they often fail to account for feature redundancy, potentially leading to suboptimal feature subsets. To overcome these limitations, we propose a novel method based on Redundancy-optimized Multi-head Attention Networks for Multi-view Multi-label Feature Selection (RMAN-MMFS). Specifically, we employ each individual attention head to model intra-view feature relationships and use the cross-attention mechanisms between different heads to capture inter-view feature complementarity. Furthermore, we design static and dynamic feature redundancy terms: the static term mitigates redundancy within each view, while the dynamic term explicitly models redundancy between unselected and selected features across the entire selection process, thereby promoting feature compactness. Comprehensive evaluations on six real-world datasets, comparing against six multi-view multi-label feature selection methods, demonstrate the superior performance of the proposed method.

Redundancy-optimized Multi-head Attention Networks for Multi-view Multi-label Feature Selection

Multivariate time series classification (MTSC) has broad applications in numerous domains. Existing MTSC methods typically focus on either temporal dynamics or variable interactions of the data, often overlooking cross-scale couplings among different variables. To bridge this gap, we propose Scale-Variable Graph Learning (SVGL), a novel framework that effectively captures data-inherent scale-variable interactions for MTSC. SVGL begins with spectral analysis to adaptively identify key periodic scales for each variable. A period-aware reservoir computing network is then incorporated to fit the variable at these scales, encoding the sequential and periodic dynamics into multi-scale dynamic representations. Subsequently, we construct a scale-variable graph to model interactions of the encoded temporal dynamics, where nodes represent scale-variable pairs and edges denote their correlations. After sparsely initializing the graph via nearest neighbors, a parallel graph learning architecture is integrated in SVGL, combining global graph convolutional and sample-specific graph attention to aggregate effective features for classification. Extensive experiments on 30 UEA datasets demonstrate that SVGL outperforms state-of-the-art baselines in accuracy and maintains low training overhead.

SVGL: Scale-Variable Graph Learning in Model Space for Multivariate Time Series Classification

Contrastive learning (CL) is a popular learning paradigm that excels in extracting meaningful representations from unlabeled data. Recent studies have shown that CL is highly vulnerable to backdoor attacks. Current defenses against backdoor attacks in CL are primarily reactive and post-training. That is, the detection and elimination of backdoors are executed in the deployment phase of a given well-trained model. However, these post-training defenses are usually prone to degrading model utility and resource-intensive, causing that the backdoor detection and elimination from a fully-trained model is quite challenging. To address this issue, we argue for a fundamental perspective, i.e., integrating the defense into the model's training phase, and propose a novel framework to mitigate the backdoor in CL, namely Density-Based Identification and Fine-Tuning (DIFT). Specifically, DIFT identifies potential poisoned samples during the early training phase via detecting embeddings with abnormal poisoning characteristic in the feature space. Then, to remove backdoors and preserve model utility, the detected poisoned samples are leveraged to fine-tune the model, and the remaining clean samples are further involved into training the model after the fine-tuning. DIFT, as a proactive training-time defense, avoids the problematic backdoor removal and the high computational cost associated with those reactive post-training methods. We empirically evaluate DIFT on various CL algorithms against backdoor attack. Experimental results demonstrate that our method exhibits promising defense effectiveness while maintaining model's clean data accuracy.

DIFT: Protecting Contrastive Learning Against Data Poisoning Backdoor Attacks

Despite remarkable advancements in supervised pansharpening neural networks, these methods face domain adaptation challenges of resolution due to the intrinsic disparity between simulated reduced-resolution training data and real-world full-resolution scenarios. To bridge this gap, we propose an unsupervised pansharpening framework, CLIPPan, that enables model training at full resolution directly by taking CLIP, a visual-language model, as a supervisor. However, directly applying CLIP to supervise pansharpening remains challenging due to its inherent bias toward natural images and limited understanding of pansharpening tasks. 
Therefore, we first introduce a lightweight fine-tuning pipeline that adapts CLIP to recognize low-resolution multispectral, panchromatic, and high-resolution multispectral images, as well as to understand the pansharpening process. Then, building on the adapted CLIP, we formulate a novel loss integrating semantic language constraints, which aligns image-level fusion transitions with protocol-aligned textual prompts (e.g., Wald's or Khan's descriptions), thus enabling CLIPPan to use language as a powerful supervisory signal and guide fusion learning without ground truth. Extensive experiments demonstrate that CLIPPan consistently improves spectral and spatial fidelity across various pansharpening backbones on real-world datasets, setting a new state of the art for unsupervised full-resolution pansharpening.

CLIPPan: Adapting CLIP as a Supervisor for Unsupervised Pansharpening

The task of image feature matching aims to establish correct correspondences between images from two different views. While approaches based on attention mechanisms have demonstrated remarkable advancements in image feature matching, they still encounter substantial limitations. Specifically, current graph attention network approaches face performance bottlenecks in complex scenarios, such as low-texture regions or occlusions. This limitation stems from the self-attention mechanism, which, when lacking effective guidance, can lead to divergent attention weights or incorrect focus on regions with low discriminability, resulting in matching failures in low-texture environments. Inspired by how humans focus on distinctive regions when performing cross-view matching, we enhance attention to singular points in images that are salient, unique and have high cross-view matching potential during information aggregation, thereby improving matching capability. To realize the aforementioned strategies, we develop a novel Singularity-enhanced Graph Attention Network (SGAT). SGAT leverages Co-potentiality and Multi-Scale Singularity as prior guidance, and designs a Singularity-aware Attention mechanism and a Co-potentiality Guided Attention mechanism , specifically enhancing the perception of singularity and matching potential during feature interaction. Experimental results on multiple datasets, including ScanNet1500, demonstrate that our method outperforms current state-of-the-art sparse matching methods. In particular, the improvement is most pronounced in complex scenarios such as low-texture environments, significantly enhancing the accuracy and robustness of image matching and its downstream tasks. The code will be publicly released.

Content not yet available

Next from AAAI 2026

TIDE: Temporal-Aware Sparse Autoencoders for Interpretable Diffusion Transformers in Image Generation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES