Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
As interest grows in generating long, detailed image captions, existing automatic evaluation metrics are increasingly strained. N-gram-based metrics though efficient, fail to capture semantic correctness, especially for longer outputs. Representational Similarity (RS) metrics, designed to address this, initially saw limited use due to high computational costs, while today, despite advances in hardware, they remain unpopular as they fall short even of weak baselines such as BLEU. Meanwhile, metrics based on large language models (LLMs) show strong correlation with human judgments, but remain too expensive for use in model development. We introduce SPECS (Specificity-Enhanced CLIP-Score), a reference-free RS metric tailored for long image captioning. SPECS modifies CLIP with a new objective that emphasizes specificity: rewarding correct details and penalizing errors. We show that SPECS matches the performance of leading LLM-based metrics in correlating with human judgments, while being far more efficient. This makes it a practical alternative for iterative checkpoint evaluation during image captioning model development.