Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Large language models (LLMs) are increasingly integral as productivity assistants, but existing benchmarks fall short in rigorously evaluating their real-world instruction-following capabilities. Current benchmarks often (i) lack sufficient multilinguality, (ii) fail to capture the implicit constraints inherent in user requests, and (iii) overlook the complexities of multi-turn dialogue. To address these critical gaps and provide a more realistic assessment, we introduce ProductivityBench, a novel benchmark specifically designed for LLM-based productivity assistants. ProductivityBench distinguishes itself by featuring input prompts across 12 languages, incorporating intra-instance multilingual instructions, employing rigorous evaluation criteria to capture both explicit and implicit constraints, and including complex multi-turn dialogue scenarios with both accumulating constraints and context switches. Furthermore, to ensure the reliability evaluation, we refined constraints using an LLM validator. Extensive experiments demonstrate that ProductivityBench presents significantly greater challenges than existing benchmarks; for instance, a strong model like GPT-o1 achieved only a 69.07% overall pass rate. ProductivityBench offers a demanding and realistic assessment of LLM in practical productivity settings, highlighting their capabilities and limitations.