Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Continuous sign language recognition (CSLR) technology enables social communication for the hearing-impaired by converting sign language videos into text. However, due to the limited receptive fields of convolutional networks and inefficient long-range dependency modeling in temporal modules, current methods find it difficult to capture cross-regional and high-order dynamic semantics in complex gestures. To address these limitations, we propose a dynamic spatiotemporal hypergraph network named HyperSign, which optimizes feature learning through innovative graph architectures. For single-frame spatial modeling, we propose a saliency-aware spatial graph construction strategy that dynamically quantifies semantic saliency by integrating feature complexity and motion intensity information from patches. This strategy can adaptively adjust node connectivity based on the computed saliency, thereby enabling the graph structure to focus on information-dense regions such as hands and faces. For temporal dependency modeling, we abandon the conventional pairwise frame interactions and propose a temporal hypergraph construction method. This method employs a learnable clustering algorithm to aggregate semantically correlated nodes within temporal windows into hyperedges, thereby explicitly capturing high-order associations within individual gesture actions that span multiple frames. Extensive experiments on the PHOENIX14, PHOENIX14-T, and CSL-Daily datasets demonstrate that HyperSign outperforms the state-of-the-art (SOTA) approaches in CSLR without any additional annotation information, establishing a new feature learning paradigm for the CSLR task.