Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Vision modeling has advanced rapidly with Transformers, whose attention mechanisms capture visual dependencies but lack a principled account of how semantic information propagates spatially. To address this, many physics-inspired models adopt heat conduction dynamics, where the temporal decay rate is proportional to the product of time and spatial frequency, inherently coupling them and causing high-frequency components to decay much faster than low-frequency ones. However, this preferential decay of high-frequency signals suppresses textures, edges, and other fine details that are crucial for preserving semantic richness in vision models. In this paper, we introduce WaveFormer, a novel physics‑inspired vision backbone that leverages frequency–time decoupled wave propagation. By decoupling frequency from temporal evolution through an underdamped wave equation, high‑frequency components oscillate rather than being rapidly damped, preserving fine‑grained details while maintaining low‑frequency stability. For efficient and interpretable modeling, we derive a closed-form solution of the underdamped wave equation that decouples frequency from temporal evolution. Leveraging this principle, we implement the Frequency–Time Decoupled Wave Propagation Operator (WPO), a lightweight module that models global interactions in $\mathcal{O}(N \log N)$ time—far lower than the $\mathcal{O}(N^2)$ cost of attention. We propose a family of WaveFormer models as drop-in replacements for standard ViTs and CNNs, achieving competitive accuracy across image classification, object detection, and semantic segmentation, while delivering up to $1.6\times$ higher throughput and 30\% fewer FLOPs than attention-based alternatives. Furthermore, our results demonstrate that wave propagation introduces a complementary modeling bias to heat-based approaches, effectively capturing both global coherence and high-frequency details essential for rich visual semantics.
