Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Rectified flow models have shown strong potential in high-fidelity video generation, yet extending them to high-resolution remains challenging due to the high cost of full attention and error accumulation in the ODE-solving process. In this paper, we propose S$^2$Flow, a training-free framework that enables efficient and authentic high-resolution video generation by jointly exploring \textbf{Flow}-guided \textbf{S}parse attention and \textbf{S}econd-order ODE solution. Specifically, S$^2$Flow exploits and transfers the semantic and structural information from the low-resolution flow trajectory to guide the high-resolution flow in two aspects. First, S$^2$Flow dynamically captures the sparse patterns of the spatio-temporal attention maps from low-resolution videos to construct localized 3D windows, enabling efficient window attention in high-resolution inference. This can significantly reduce redundant computation while preserving contextual dependencies. Second, S$^2$Flow adopts a second-order ODE solver based on Taylor expansion, where the high-order derivative is approximated via central difference from the low-resolution flow, facilitating accurate high-resolution denoising. Extensive experiments on VBench dataset demonstrate that S$^2$Flow outperforms prior methods in both visual quality and inference speed, enabling $4\times$ acceleration on $2560 \times 1536$ video generation.
