Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Dynamic Vision Sensor (DVS) asynchronously records sparse events triggered by changes in pixel intensity, offering high temporal resolution and low latency. Existing frame-based methods process event data densely, violating its inherent sparsity and introducing computational redundancy. While asynchronous models preserve the event stream's native format, they often neglect spatial information, compromising their adaptability and efficiency. To address these limitations, we propose a Spatiotemporally Separated Sparse Network (S3Net) for efficient event stream encoding and learning. Specifically, we employ a learnable sparse encoding scheme to construct a voxel-structured representation that effectively extracts spatiotemporal relationships among event data. After that, we propose a dual-branch architecture to capture localized spatial dependencies and dynamic temporal patterns of event data. By explicitly decoupling spatial and temporal modeling, S3Net enables end-to-end asynchronous processing of variable-length event sequences, achieving both strong representational capacity and high computational efficiency. Experimental results on six event-based datasets demonstrate that S3Net achieves state-of-the-art performance. Compared to frame-based methods, it significantly reduces computational overhead and model complexity, while also outperforming existing asynchronous approaches in inference speed without compromising accuracy. Extensive experiments across six event-based datasets show that S3Net establishes new state-of-the-art performance. Our method reduces computational costs by 35% and model parameters by 27% compared to frame-based approaches, while delivering 1.58× faster inference than existing point-based methods at comparable accuracy levels.
