Content not yet available
This lecture has no active video or poster.
Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Multi-view 3D object detection has garnered increasing attention, particularly due to its success in autonomous driving systems. Although multi-view systems possess rich semantic information, their spatial-geometric reasoning capabilities remain limited. Recent studies employ simulated point cloud generation mechanisms to facilitate LiDAR-camera multi-modal knowledge distillation, achieving formal structural consistency. However, these methods still suffer from two major drawbacks i) alignment challenges arising from significant discrepancies between LiDAR and camera, and ii) prediction errors from simulated point cloud may degrade the extracted image semantic information during fusion. Accordingly, we propose adaptive-smooth distillation to optimize the granularity of the alignment based on the feature discrepancy for LiDAR-camera knowledge distillation. Specifically, this work considers both LiDAR to camera cross-modal distillation and LiDAR-camera fusion to simulated point cloud-camera fusion multi-modal distillation. Then, we introduce a heterogeneous fusion module to strategically bias the fusion process toward the extracted camera features, thereby enhancing the robustness of the fusion feature. Additionally, soft-weighted response distillation is proposed to facilitate the student model to selectively mimic the high-quality output of the teacher model. Extensive experiments have quantified the superiority of our method, achieving statistically significant improvements of 4.9% mAP and 4.5 % NDS.
