Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Open-Vocabulary Object Detection (OVOD) aims to detect both known and novel categories in complex visual scenes, surpassing the limitations of conventional closed-set detectors. Recent advances in vision-language models (VLMs) have enabled zero-shot recognition by aligning visual features with large-scale textual embeddings. However, current OVOD approaches often fall short by overlooking critical contextual background cues necessary for discovering broader novel objects. To address this, we propose BFDet, a scene-to-object reasoning framework that leverages the complementary capabilities of Large Language Models (LLMs) and VLMs. BFDet introduces a novel knowledge discovery mechanism that models the interaction between foreground and background context. High-confidence objects are first used to infer the background scene, which in turn guides an LLM to generate context-aware novel object candidates. Verified through cross-modal alignment, these candidates serve as reliable pseudo-labels to supervise detector training. Designed as a plug-and-play module, BFDet integrates seamlessly into existing detection pipelines and consistently improves performance on novel categories across COCO and LVIS benchmarks, leading to +3.1 AP gain in detecting novel categories while maintaining strong performance on known ones.