Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Recent advancements in large language models (LLMs) have shifted focus toward scaling inference-time compute—improving performance without retraining the model. A common approach is to sample multiple outputs in parallel, and select one of these as the final output. This was shown to boost output quality in multiple settings for English. However, the question remains about how to best apply these methods across diverse languages and tasks. In this work, we study how to robustly scale inference-time compute for open-ended generative tasks in a multilingual, multi-task setting. Our findings show that both sampling strategy---based on temperature variation---and selection strategy must be adapted to account for language-specific characteristics. We evaluate existing and novel selection methods, revealing that strategies effective in English often fail to generalize across languages. Our results underscore the need for language- and task-aware approaches to inference-time compute, aiming to democratize performance improvements in underrepresented languages.