Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Most existing multi-modal trackers adopt uniform fusion strategies and propagate temporal information through mixed tokens, which fails to account for modality-specific differences and results in entangled temporal representations. To address these limitations, we propose a Modality-aware fusion and Decoupled temporal propagation multi-modal object Tracking (MDTrack). Specifically, for modality-aware fusion, we allocate dedicated experts to each modality (Infrared, Event, Depth, and RGB) to process their respective representations. The gating mechanism within the mixture of experts (MoE) then dynamically selects the optimal experts based on the input features, enabling adaptive and modality-specific fusion. For decoupled temporal propagation, we introduce two separate State Space Model (SSM) structures to independently store and update the hidden states $h$ of the RGB and X-modal streams, effectively capturing their distinct temporal information. To ensure synergy between the two temporal representations, we incorporate cross-attention between the input features of the two SSMs, facilitating implicit information exchange. The resulting temporally enriched features are then integrated into the backbone via cross-attention, enhancing MDTrack’s ability to leverage temporal information. Extensive experiments demonstrate the effectiveness of our proposed MDTrack. Both MDTrack-S (Modality-Specific Training) and MDTrack-U (Unified-Modality Training) achieve state-of-the-art performance on five multi-modal tracking benchmarks.