Singapore

Attribute-specific fashion retrieval aims to enhance fine-grained image retrieval by emphasizing the similarity of specific attributes. Current methods primarily rely on attention mechanisms to extract attribute-related visual features but face two key challenges: the limitations of coarse-grained localization in achieving fine-grained accuracy, and an imbalance between global and local perception, where excessive focus on local features can undermine overall performance. To address these issues, we propose the fashion microscope ***Pro*****Fashion**, which achieves pixel-level attribute awareness through optimal transport and neural semantic aggregation. The framework begins by employing optimal transport to align semantic attributes with visual patterns from a global perspective, generating an attribute-visual value map that highlights distinctive regions while reducing interference. This is followed by simulating the human brain&#39;s perception of attribute feature patterns through superpixel generation and aggregation, capturing attribute-related features at the pixel semantic level and forming key semantic clusters that preserve microstructures. Building on this, an attribute graph is constructed to facilitate feature clustering, significantly enhancing the framework&#39;s capability to handle overlapping features and cross-scale relationships. Comprehensive experiments on the *FashionAI*, *DeepFashion*, and *DARN* datasets demonstrate the framework&#39;s effectiveness, achieving overall MAP improvements of **3.11%**, **3.70%**, and **3.49%**, respectively. Additionally, the framework delivers relative average throughput gains of **26.94%**, **22.22%**, and **24.78%** on the *FashionAI*, *DeepFashion*, and *DARN* datasets, respectively.

AAAI 2026

Fashion Microscope: Pixel-Level Attribute Perception via Optimal Transport and Neural Semantic Aggregation

attribute-specific fashion retrieval

data mining & knowledge management

contrastive learning

Attribute-specific fashion retrieval aims to enhance fine-grained image retrieval by emphasizing the similarity of specific attributes. Current methods primarily rely on attention mechanisms to extract attribute-related visual features but face two key challenges: the limitations of coarse-grained localization in achieving fine-grained accuracy, and an imbalance between global and local perception, where excessive focus on local features can undermine overall performance. To address these issues, we propose the fashion microscope ***Pro*****Fashion**, which achieves pixel-level attribute awareness through optimal transport and neural semantic aggregation. The framework begins by employing optimal transport to align semantic attributes with visual patterns from a global perspective, generating an attribute-visual value map that highlights distinctive regions while reducing interference. This is followed by simulating the human brain's perception of attribute feature patterns through superpixel generation and aggregation, capturing attribute-related features at the pixel semantic level and forming key semantic clusters that preserve microstructures. Building on this, an attribute graph is constructed to facilitate feature clustering, significantly enhancing the framework's capability to handle overlapping features and cross-scale relationships. Comprehensive experiments on the *FashionAI*, *DeepFashion*, and *DARN* datasets demonstrate the framework's effectiveness, achieving overall MAP improvements of **3.11%**, **3.70%**, and **3.49%**, respectively. Additionally, the framework delivers relative average throughput gains of **26.94%**, **22.22%**, and **24.78%** on the *FashionAI*, *DeepFashion*, and *DARN* datasets, respectively.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

While vision-language models (VLMs) have demonstrated remarkable performance across various tasks combining textual and visual information, they continue to struggle with fine-grained visual perception tasks that require detailed pixel-level analysis. Effectively eliciting comprehensive reasoning from VLMs on such intricate visual elements remains an open challenge. In this paper, we present VipAct, an agent framework that enhances VLMs by integrating multi-agent collaboration and vision expert models, enabling more precise visual understanding and comprehensive reasoning. VipAct consists of an orchestrator agent, which manages task requirement analysis, planning, and coordination, along with specialized agents that handle specific tasks such as image captioning and vision expert models that provide high-precision perceptual information. This multi-agent approach allows VLMs to better perform fine-grained visual perception tasks by synergizing planning, reasoning, and tool use. We evaluate VipAct on benchmarks featuring a diverse set of visual perception tasks, with experimental results demonstrating significant performance improvements over state-of-the-art baselines across all tasks. Furthermore, comprehensive ablation studies reveal the critical role of multi-agent collaboration in eliciting more detailed System-2 reasoning and highlight the importance of image input for task planning. Additionally, our error analysis identifies patterns of VLMs' inherent limitations in visual perception, providing insights into potential future improvements. VipAct offers a flexible and extensible framework, paving the way for more advanced visual perception systems across various real-world applications.

VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use

Spiking Neural Networks (SNNs) offer promising energy efficiency and temporal sparsity for edge intelligence, but their training remains difficult due to gradient mismatch, membrane potential drift, and discretization errors. In this paper, we propose a membrane potential guided surrogate optimization framework that dynamically aligns the surrogate function with the membrane potential distribution to enhance gradient propagation. Specifically, we introduce a KL-divergence-based regularization to stabilize membrane potential dynamics, and an adaptive width constraint to synchronize the surrogate gradient range with neural activity statistics. Additionally, we design a spike discretization error metric and a correction strategy to mitigate temporal discretization effects. Experiments on CIFAR-10, CIFAR-100, and ImageNet show our method achieves 94.15\%, 72.20\%, and 65.70\% top-1 accuracy respectively, while improving gradient stability and energy efficiency. This work provides a principled optimization scheme for robust and scalable SNN training in practical neuromorphic systems.

Optimization Method for Surrogate Function in Spiking Neural Networks Based on Membrane Potential Distribution

Large language models sometimes inadvertently reproduce passages that are copyrighted, exposing downstream applications to legal risk. Most existing studies for inference-time defences focus on surface-level token matching and rely on external blocklists or filters, which add deployment complexity and may overlook semantically paraphrased leakage. In this work, we reframe copyright infringement mitigation as intrinsic semantic-space control and introduce SCOPE, an inference-time method that requires no parameter updates or auxiliary filters. Specifically, the sparse autoencoder (SAE) projects hidden states into a high-dimensional, near-monosemantic space; benefiting from this representation, we identify a copyright-sensitive subspace and clamp its activations during decoding. Experiments on widely recognized benchmarks show that SCOPE mitigates copyright infringement without degrading general utility. Further interpretability analyses confirm that the isolated subspace captures high-level semantics.

SCOPE: Intrinsic Semantic Space Control for Mitigating Copyright Infringement in LLMs

Large Vision-Language Models (LVLMs) enhance performance on vision-language tasks by integrating visual features from pre-trained vision encoders into large language models (LLMs). However, the large number of visual tokens introduces significant computational overhead. Existing token pruning methods either perform global selection via [CLS]-based attention in the vision encode or prune within LLM decoding layers. These approaches face two key challenges: (1) [CLS]-based attention primarily focuses on visually salient regions across the entire image, often overlooking semantically important tokens essential for reasoning; and (2) strong positional bias in the shallow decoder layers causes the model to favor later-positioned tokens, while neglecting earlier ones that may carry critical reasoning cues. To address these issues, we propose PosPrune, a training-free, two-stage visual token pruning framework. At the vision encoder, we introduce an Asymmetric Region-aware Pruning (ARP) strategy that retains more tokens in semantically rich regions while discarding more tokens from semantically less informative regions, thus preserving spatial diversity and task-relevant details. In the LLM decoding stage, we find that the positional bias in shallow layers is primarily driven by model architecture rather than task semantics. Based on this insight, we propose a novel Positional Bias Correction (PBC) mechanism to mitigate this bias. To further reduce redundancy, we apply Maximal Marginal Relevance (MMR) to select tokens that best balance textual relevance and diversity. Extensive experiments on various LVLMs and benchmarks demonstrate the general effectiveness of our approach. Notably, when applied to LLaVA-1.5-7B, PosPrune achieves a reduction of 85% in FLOPs while preserving 98.5% of the original performance.

PosPrune: Visual Token Pruning with Positional Bias Correction for Efficient Large Vision-Language Models

Cloud removal (CR) in remote sensing imagery is a critical yet challenging task due to complex cloud patterns and diverse underlying ground structures. Despite recent progress in generative models such as diffusion models, CR remains limited by its inadequate capability to perceive and reconstruct structured information beneath cloud-covered areas. In this work, we propose a Visibility-guided Semantic Estimation and Reconstruction network for cloud removal (VISER-CR), which reformulates CR as a structure-guided completion problem. Specifically, VISER-CR explicitly models cloud interference via spatial masking, encouraging the model to reason beyond pixel-level appearance and enhance scene-level structural understanding. Moreover, to further improve the representation of structural information, we introduce Patch Saliency Encoding, a self-guided mechanism that implicitly models structural alignment among patches, significantly enhancing clustering consistency and semantic separability in the latent space. This adaptive mechanism guides the network to focus on learning and reconstructing structurally important regions, thereby reducing redundancy and improving overall cloud removal performance. Extensive experiments on multiple benchmark datasets demonstrate the superior effectiveness of our method.

Revealing the Invisible: Latent Structure Modeling for Semantically Consistent Cloud Removal

Robot navigation in dense crowds requires understanding social cues that humans naturally use, yet existing methods struggle with real-world complexity. We investigate two questions: (1) Where do pedestrians look when navigating crowds? and (2) Can eye tracking improve robot navigation? To answer, we introduce GazeNav, an egocentric dataset collected via wearable eye trackers, featuring synchronized video, gaze, and trajectories in crowded environments. Analysis reveals that the gaze of pedestrians is closely related to the semantic presence and movement of other individuals, exhibiting distinct attention patterns across navigation behaviors. Building on this, we propose Gaze2Nav, a modular framework that first predicts human gaze to infer socially salient pedestrians, then incorporates the semantic attention into motion planning alongside visual inputs. Our method achieves 87.6% salient pedestrian prediction accuracy and reduces trajectory error by 15.4% over state-of-the-art baselines. By aligning with human gaze, our framework improves both performance and interpretability, advancing toward human-like, socially intelligent robot navigation.

Learning from Human Gaze: Human-like Robot Social Navigation in Dense Crowds

Large-scale EEG foundation models have shown strong generalization across a range of downstream tasks, but their training remains resource-intensive due to the volume and variable quality of EEG data. In this work, we introduce EEG-D², a data distillation framework that enables more efficient pre-training by selectively removing noisy and redundant samples from large EEG datasets. EEG-D² begins by encoding EEG segments into compact latent representations using a self-supervised autoencoder, allowing sample selection to be performed efficiently and with reduced sensitivity to noise. Based on these representations, EEG-D² filters out outliers and minimizes redundancy, resulting in a smaller yet informative subset that retains the diversity essential for effective foundation model training. Through extensive experiments, we demonstrate that training on only 5 percent of a 2,500-hour dataset curated with EEG-D² yields performance comparable to, and in some cases better than, training on the full dataset across multiple downstream tasks. To our knowledge, this is the first systematic study of pre-training data distillation in the context of EEG foundation models. EEG-D² provides a scalable and practical path toward more effective and efficient physiological foundation modeling.

EEG-DLite: Dataset Distillation for Efficient Large EEG Model Training

3D Gaussian Splatting (3DGS) has emerged as a mainstream solution for real-time rendering and high-fidelity novel view synthesis. Building on this foundation, methods based on Textured Gaussians further improve the expression ability by incorporating explicit texture mapping into Gaussians. However, their reliance on fixed texture resolution often results in noticeable visual incoherence, triggering artifacts such as aliasing or inconsistent sharpness under different viewpoints.
To address these issues, we propose PATexGS, a perceptual-adaptive texture scheduling framework designed to improve visual coherence for Textured Gaussians. Specifically, we introduce an entropy-guided texture allocation strategy that dynamically adjusts texture resolution based on each Gaussian’s spatial gradient and rendering contribution, constantly preserving details while being memory efficiency. Furthermore, we incorporate a mipmap-inspired hierarchical scheduling mechanism that adaptively schedule texture levels according to view-dependent projection scale, effectively suppressing aliasing and further enhancing perceptual consistency.
Extensive experiments on diverse real-world scenes demonstrate that PATexGS significantly improves visual coherence while maintaining high rendering quality, outperforming existing TexturedGS variants in both perceptual fidelity and storage efficiency.

PATexGS: Perceptual-Adaptive Texture Scheduling for Visual Coherence in Textured Gaussian Splatting

The massive scale of data and computation required for training Multimodal Large Language Models (MLLMs) has fueled the rise of Fine-Tuning as a Service (FTaaS), enabling users to rapidly customize models for diverse real-world tasks. While FTaaS democratizes access to advanced multimodal intelligence, it also introduces serious security concerns, particularly backdoor attacks. In this work, we systematically analyze backdoor vulnerabilities in MLLMs under the FTaaS paradigm, revealing two key phenomena: (1) markedly reduced sensitivity to textual variations when a visual trigger is present, and (2) abnormally stable model confidence even under strong semantic perturbations. Building on these insights, we propose **Trap on Text (ToT)**, a novel inference-time backdoor detection framework. ToT applies controlled semantic perturbations to textual prompts and jointly analyzes the **semantic consistency** and **confidence drift** of the model’s responses, enabling robust detection of backdoor activations without requiring model parameters, architectures or clean reference data. Extensive experiments across architectures and datasets show that ToT achieves strong attack mitigation and preserves clean accuracy, offering a practical solution for safeguarding FTaaS workflows.

Probing Semantic Insensitivity for Inference-Time Backdoor Defense in Multimodal Large Language Model

Graph neural networks (GNNs) can effectively model structural information of graphs, making them widely used in knowledge graph (KG) reasoning. However, existing studies on the expressive power of GNNs mainly focuses on simple single-relation graphs, and there is still insufficient discussion on the power of GNN to express logical rules in KGs. How to enhance the logical expressive power of GNNs is still a key issue. Motivated by this, we propose Path-Neighbor enhanced GNN (PN-GNN), a method to enhance the logical expressive power of GNN by aggregating node-neighbor embeddings on the reasoning path. First, we analyze the logical expressive power of existing GNN-based methods and point out the shortcomings of the expressive power of these methods. 
Then, we theoretically investigate the logical expressive power of PN-GNN, showing that it not only has strictly stronger expressive power than C-GNN but also that its $(k+1)$-hop logical expressiveness is strictly superior to that of $k$-hop. Finally, we evaluate the logical expressive power of PN-GNN on six synthetic datasets and two real-world datasets. Both theoretical analysis and extensive experiments confirm that PN-GNN enhances the expressive power of logical rules without compromising generalization, as evidenced by its competitive performance in KG reasoning tasks.

Downloads

Next from AAAI 2026

VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES