Content not yet available
This lecture has no active video or poster.
Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Layout-to-Image generation has significantly advanced content creation by enabling the rendering of visual text under predefined spatial layouts. Current approaches achieve training-free layout guidance by constructing attention-based energy functions to derive correction gradients. In this paper, we demonstrate that vanilla energy functions suffer from two limitations, resulting in imprecise layout control and visually unrealistic artifacts. First, the normalizing factor of the Boltzmann distribution defined by the energy functions is non-negligible when calculating correction gradients, yet current energy functions cannot compute this factor exactly. Furthermore, while attention varies over time during the denoising process, existing approaches employ a fixed formulation. To address these challenges, we introduce FreLay, a novel training-free approach equipped with a frequency-aware energy function. Our method first reformulates the energy function to handle the normalization factor, enabling accurate computation of correction gradients. Simultaneously, leveraging the prior knowledge that low-frequency information deteriorates slower during noise addition, we design a time-specific energy function for each timestep from a frequency-domain perspective. Experimental results demonstrate that FreLay consistently outperforms existing state-of-the-art training-free methods by a large margin both qualitatively and quantitatively across multiple datasets. Code will be released upon acceptance.
