Lecture image placeholder

Premium content

Access to this content requires a subscription. You must be a premium user to view this content.

Monthly subscription - $9.99Pay per view - $4.99Access through your institutionLogin with Underline account
Need help?
Contact us
Lecture placeholder background

AAAI 2025

February 28, 2025

Philadelphia, United States

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

keywords:

deep neural architectures and foundation models

ml

Due to the capability of dynamic state space models (SSMs) in capturing long-range dependencies with near-linear computational complexity, Mamba has demonstrated outstanding performance in NLP tasks. This has inspired the rapid development of Mamba-based vision models, which have achieved promising results in visual recognition tasks. However, such models are not capable of distilling features across layers through feature aggregation, interaction, and selection. On the other hand, existing cross-layer feature aggregation methods designed for CNNs or ViTs are not practical in Mamba-based models due to high computational or memory costs. Therefore, this paper aims to introduce an efficient cross-layer feature aggregation mechanism for Mamba-based vision backbone networks. Inspired by the Retinal Ganglion Cells (RGCs) in the human visual system, we propose a new sparse cross-layer connection mechanism termed SparX to effectively improve cross-layer feature interaction and reuse. Specifically, we build two different types of network layers based on SSM, namely ganglion layers and normal layers. The former has higher connectivity and complexity, enabling multi-layer feature aggregation and interaction in an input-dependent manner. In contrast, the latter has lower connectivity and complexity. By interleaving these two types of layers, we design a new Mamba-based network with sparsely cross-connected layers, which achieves an excellent trade-off among model size, computational cost, memory cost, and accuracy in comparison to existing Mamba-based vision models. For instance, with fewer parameters, SparX-Mamba-T improves the top-1 accuracy of VMamba-T from 82.5\% to 83.5\%. Extensive experimental results demonstrate that our new network architecture possesses both superior performance and generalization capabilities on various vision tasks.

Next from AAAI 2025

MimiQ: Low-Bit Data-Free Quantization of Vision Transformers with Encouraging Inter-Head Attention Similarity
poster

MimiQ: Low-Bit Data-Free Quantization of Vision Transformers with Encouraging Inter-Head Attention Similarity

AAAI 2025

+5Jonghyun Choi
Kanghyun Choi and 7 other authors

28 February 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2026 Underline - All rights reserved