Content not yet available
This lecture has no active video or poster.
Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Since next-scale prediction was introduced as a new paradigm for autoregressive image generation, it has attracted extensive research interest. By progressively increasing resolution in a draft-to-refinement process, next-scale prediction demonstrates great potential in both generation quality and efficiency.However, at high resolutions, this paradigm faces a fundamental challenge: token sequences grow quadratically and accumulate across multiple scales, resulting in a key performance bottleneck. Our systematic study uncovers two critical observations: (1) most image regions have stabilized during early drafting stages, making later refinement across the full-scale image token-inefficient; (2) different scales inherently trade off efficiency and fidelity, suggesting that adaptive token dispatch on different scales can focus resources where they yield the greatest quality gains. Motivated by these insights, we propose a training-free \textbf{M}ixture \textbf{o}f \textbf{S}cale\textbf{s} (\textbf{MoSs}) method for efficient high-resolution autoregressive image generation. MoSs breaks the strict causal dependency across scales in the final refinement steps by parallelizing scales of different resolutions, each responsible for a subset of spatial regions. A lightweight frequency-based token dispatcher, analyzes the drafted image and assigns regions to the appropriate scale. The outputs are then composited over the draft to produce the final high-resolution image. The scale-mixture method exhibits remarkable efficiency with little impact on generation quality on various models. For instance, our implementaion achieves \textbf{2.05-4.96$\times$ speedup} on transformer backbone, up to \textbf{85.62\% KV cache reduction}, incurring only \textbf{0.1-2.4\%} loss on GenEval\citep{ghosh2023geneval} quality, based on state-of-the-art Infinity\citep{han2024infinity} model.