Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
While large language models (LLMs) have demonstrated strong capabilities in code generation, current benchmarks primarily focus on single-turn scenarios, neglecting the complexity of multi-turn interactions and user diversity. To address this gap, we introduce Talk2Code, the first benchmark designed for user-stratified multi-turn dialogue code generation evaluation across algorithmic problem-solving and backend programming tasks.A distinctive feature of our benchmark is its user-stratified interaction modeling. For identical coding tasks, we construct separate dialogue trajectories tailored for novice, intermediate, and expert users, capturing their distinct expectations and communication patterns.To facilitate comprehensive evaluation, we propose a multi-dimensional evaluation framework assessing both code quality and interaction experience through a novel Dual-track Evaluation Method. In the Direct Generation Track, the benchmark provides golden dialogue context (excluding the final code) directly to the LLM for code generation. In contrast, the Interactive Dialogue Track simulates realistic multi-turn interactions, prompting the model to proactively clarify instructions and gather requirements before generating solutions.Code quality is evaluated in both tracks by Test Pass Rate and Acceptance Rate, while interaction experience is assessed exclusively within the Interactive Dialogue Track through subjective and alignment indicators. Our benchmark and multi-dimensional indicator system collectively establish a new paradigm for evaluating adaptive, user-aware AI coding assistants.