Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Effectively capturing multimodal co-occurrence signals, such as hand shapes, facial expressions, and body postures, is critical for semantic understanding in sign language recognition (SLR) and translation (SLT). Although skeleton data offer greater efficiency and robustness than RGB inputs, existing methods typically rely on pairwise graph structures, limiting their ability to model complex high-order interactions across body regions. To address this limitation, we propose HyperSign, a hierarchical hypergraph neural network that systematically captures high-order co-occurrence patterns among diverse body parts. The Co-occurrence Graph Perception Module jointly learns relational structures via three complementary pathways: (1) traditional graph convolutions for modeling physical joint connections, (2) dynamic geometric hypergraphs constructed via k-nearest neighbors to encode local spatial patterns, and (3) soft hypergraphs generated by learnable prototypes to reveal latent semantic associations. To further enhance structural modeling and semantic consistency, a Meta-Part Hypergraph Fusion Module abstracts feature streams from the hands, face, and body into unified hypergraph nodes, while leveraging empirically derived co-occurrence priors to model high-order cross-part dependencies. Moreover, an uncertainty-aware collaborative distillation mechanism guides the model to focus on critical body regions. Extensive experiments on standard SLR and SLT benchmarks (e.g., PHOENIX 2014, PHOENIX 2014T, and CSL Daily) demonstrate that HyperSign not only outperforms existing skeleton-based approaches in both speed and accuracy but also achieves competitive or superior results compared to several state-of-the-art RGB-based methods across multiple evaluation metrics.