Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Open-vocabulary multi-object tracking (OV-MOT) aims to track objects with unseen categories beyond the training set. While existing methods rely on pseudo video sequences synthesized from static images, they struggle to model realistic motion patterns, resulting in limited association performance in real-world scenarios. To alleviate these issues, we propose SAM2-OV, a novel association learning-free OV-MOT method that adopts a detection-only tuning paradigm, eliminating the need for synthetic sequences or spatiotemporal supervision and substantially reducing the overall learnable parameters. The core of our method is a Unified Detection Module (UDM), which effectively provides object-level prompts to enable SAM2 for OV-MOT. Enabled by UDM, SAM2-OV is the first to integrate SAM2 for OV-MOT, fully unleashing its zero-shot cross-frame association ability.To further enhance object association under occlusion and abrupt motion, we introduce a Motion Prior Assistance Module (MPAM) that incorporates motion cues into the mask selection process. In addition, a Semantic Enhancement Adapter (SEA) distilled from CLIP is used to improve classification generalization. A sparse prompting strategy is also adopted to reduce computational redundancy by triggering detection only on selected keyframes. As only the detection module is tuned on static images, the overall training process remains simple and efficient. Experiments on the TAO dataset demonstrate that SAM2-OV achieves state-of-the-art performance under the TETA metric, particularly on novel categories. Evaluations on the KITTI dataset show the strong zero-shot cross-domain transferability of our SAM2-OV.
