Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Recent advances in deep learning-based 3D representation have achieved remarkable success, particularly in modeling static high-fidelity geometries. However, the extension of these techniques to dynamic 3D scenes introduces a critical challenge of effectively representing spatio-temporal dependencies, i.e., jointly modeling detailed spatial structures within frames and temporal dynamics across frames. To address this challenge, this paper proposes that the temporal evolution observed in dynamic 3D scenes is fundamentally attributable to the deformation of underlying spatial structures. To capture this relationship, we introduce a unified continuous 4D latent space representation incorporating a structure-equivalence prior, named SEP-4D. The core of SEP-4D is an efficient 4D tensor decomposition-fusion approach. This method fuses decomposed learnable 2D feature planes via a plane-wise spatio-temporal fusion mechanism of planar distributions, explicitly enforcing the principle that temporal evolution originates from geometric deformations of the 3D structure. To mitigate the associated computational demands, we sample the 3D probability volumes generated by VAE-based fusion into a spatio-temporally consistent 4D latent representation. The efficacy of our approach is validated through experiments on the fundamental task of 4D occupancy reconstruction. Extensive results demonstrate that, by leveraging the inherent equivalence of temporal dynamics and structural deformation, our method achieves high-quality reconstruction across various sequence lengths. Notably, for 4-frame scenes, we attain an impressive 91.68\% mIoU, significantly outperforming state-of-the-art baselines on standard benchmarks. The code will be publicly available.