Content not yet available
This lecture has no active video or poster.
Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Text-to-image person re-identification (TIReID) aims to retrieve the most relevant pedestrian images from an image gallery based on natural language descriptions. Recent studies have achieved significant performance improvements by leveraging Masked Language Modeling (MLM) to align fine-grained information through local matching. However, in the text feature extraction, randomly masking text tokens may disrupt the semantic relationships between these local tokens, leading to feature misalignment; on the other hand, from an image feature perspective, redundant patches in pedestrian images hinder the information interaction across modalities. Moreover, the presence of noisy image-text pairs further complicates the learning process, as the model may be misled into recognizing incorrect patterns. To address these issues, we propose a robust fine-grained local alignment framework based on Key Phrase Dynamic Mask (KPDM). First, we strengthen the semantic relationships between text tokens by implementing a "adjective + noun" phrase-level masking strategy, mitigating local misalignment. Additionally, we integrate cross-layer importance estimation to highlight key pedestrian image representations while removing redundant image features. Building on this, we design a novel frequency-based masked language loss (FMLM) to supervise fine-grained semantic-level local alignment. Second, we propose a trusted consensus partitioning mechanism, utilizing intra-identity image-text similarity distributions to identify noisy pairs, enhancing the model robustness. Extensive experiments show that our method achieves 67.95\% Rank-1 and 51.88\% mAP on the RSTPReid dataset, exceeding the previous state-of-the-art by 2.6\% and 1\%. Furthermore, KPDM achieves Rank-1 accuracies of 75.97\% on the CUHK-PEDES dataset and 67.78\% on the ICFG-PEDES dataset, outperforming earlier methods.
