Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Old Abstract: Sign Language Translation has advanced with deep learning, yet evaluations remain signer-dependent, with overlapping signers in training, development, and test sets. This raises concerns about whether models truly generalise or rely on signer-specific features. To address this, signer-fold cross-validation is conducted on GFSLT-VLP, GASLT, and SignCL—three leading, publicly available, non-proprietary gloss-free sign language translation models, with SignCL being among the most prominent. Experiments are performed on two benchmarking datasets, CSL-Daily and PHOENIX14T. The results reveal a significant performance drop under signer-independent settings. On PHOENIX14T, GFSLT-VLP sees BLEU-4 fall from 21.44 to as low as 3.59 and ROUGE-L from 42.49 to 11.89; GASLT drops from a reported 15.74 to 8.26; and SignCL from 22.74 to 3.66. These findings highlight the substantial overestimation of SLT model performance when evaluations are conducted under signer-dependent assumptions. This work proposes two key recommendations: (1) adopting signer-independent evaluation protocols, and (2) restructuring datasets to include signer-independent splits. Updated Abstract: Sign Language Translation has advanced with deep learning, yet evaluations remain signer-dependent, with overlapping signers in training, development, and test sets. This raises concerns about whether models truly generalise or rely on signer-specific features. To address this, signer-fold cross-validation is conducted on GFSLT-VLP, GASLT, and SignCL—three leading, publicly available, non-proprietary gloss-free sign language translation models. Experiments are performed on two benchmarking datasets, CSL-Daily and PHOENIX14T. The results reveal a significant performance drop under signer-independent settings. On PHOENIX14T, GFSLT-VLP sees BLEU-4 fall from 21.44 to as low as 3.59 and ROUGE-L from 42.49 to 11.89; GASLT drops from a reported 15.74 to 8.26; and SignCL from 22.74 to 3.66. Similarly, on CSL-Daily, GASLT’s BLEU-4 drops from 4.07 to an average of 3.63 under signer-fold cross-validation, despite the increased training data. These findings highlight the substantial overestimation of SLT model performance when evaluations are conducted under signer-dependent assumptions. This work proposes three key recommendations: (1) adopting signer-independent evaluation protocols to ensure generalisation to unseen signers, (2) restructuring existing datasets to include explicit signer-independent splits for consistent benchmarking, and (3) encouraging the reporting of both signer-dependent and signer-independent results to improve transparency and comparability.