Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
We propose a novel framework, Continuous-Time Attention, which infuses partial differential equations (PDEs) into the Transformer’s attention mechanism to address the challenges of extremely long input sequences. Instead of relying solely on a static attention matrix, we allow attention weights to evolve over a pseudo-time dimension via diffusion, wave, or reaction-diffusion dynamics. This mechanism systematically smooths local noise, enhances long-range dependencies, and stabilizes gradient flow. Theoretically, our analysis shows that PDE-based attention leads to better optimization landscapes and polynomial rather than exponential decay of distant interactions. Empirically, we benchmark our method on diverse experiments—demonstrating consistent gains over both standard and specialized long-sequence Transformer variants. Our findings highlight the potential of PDE-based formulations to enrich attention mechanisms with continuous-time dynamics and global coherence.