Content not yet available
This lecture has no active video or poster.
Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Auto-regressive (AR)-based decoders, owing to their flexibility in handling variable-length outputs and their strong capability in modeling character-level dependencies, have emerged as the predominant decoding paradigm in the field of scene text recognition (STR). However, AR-based decoders suffer from attention drift, slow decoding speed, and difficulty capturing global dependencies, restricting their performance in various scenarios. In this paper, we propose a novel paradigm for AR-based decoding, called One-Token to Sequence (One2Seq), to address the above issues. Unlike existing methods, we encode the semantic features into a single context token and design a One-Token Wise Decoder to perform the decoding, which alleviates the attention drift caused by the accumulation of semantic information. Moreover, we proposed Positioal-aware Hash Embedding to embed the decoded characters, ensuring the order information is obtained in the context token. By continuously updating this token, One2Seq fully leverages the decoded semantic information while avoiding the computational overhead associated with the growing query sequence. Furthermore, to leverage global information for decoding, we propose Dynamic Global Infusion to dynamically integrates global visual features into the context token. Equipped with the enriched context token, the model has an enhanced ability to extract discriminative local features under the guidance of global context, thereby enhancing recognition accuracy. Extensive experiments reveal that, with its ingenious design, One2Seq exhibits marked superiority on both accuracy and decoding speed compared to existing STR models.