RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation

Peking University

*Equal Contribution
Interpolate start reference image.

TL;DR: We propose a training-free framework that enables high-quality text-to-image spatial control under arbitrary conditions. By introducing rich structure and appearance control, our method effectively addresses key limitations of prior work, including structure misalignment, condition leakage, and artifacts.

Abstract

Text-to-image (T2I) diffusion models have shown remarkable success in generating high-quality images from text prompts. Recent efforts extend these models to incorporate conditional images (e.g., canny edge) for fine-grained spatial control. Among them, feature injection methods have emerged as a training-free alternative to traditional fine-tuning-based approaches. However, they often suffer from structural misalignment, condition leakage, and visual artifacts, especially when the condition image diverges significantly from natural RGB distributions. Through an empirical analysis of existing methods, we identify a key limitation: the sampling schedule of condition features, previously unexplored, fails to account for the evolving interplay between structure preservation and domain alignment throughout diffusion steps. Inspired by this observation, we propose a flexible training-free framework that decouples the sampling schedule of condition features from the denoising process, and systematically investigate the spectrum of feature injection schedules for a higher-quality structure guidance in the feature space. Specifically, we find that condition features sampled from a single timestep are sufficient, yielding a simple yet efficient schedule that balances structure alignment and appearance quality. We further enhance the sampling process by introducing a restart refinement schedule, and improve the visual quality with an appearance-rich prompting strategy. Together, these designs enable training-free generation that is both structure-rich and appearance-rich. Extensive experiments show that our approach achieves state-of-the-art results across diverse zero-shot conditioning scenarios.

Video

Pipeline

Interpolate start reference image.

Given a condition image and a text prompt, our method generates an output image that semantically aligns with the prompt while preserving the structure of condition image. Our framework consists of three key components. (i) The Structure-Rich Injection (SRI) module selects the structure-rich features of the condition image, and inject them to the output image feature space to enable spatial control. (ii) The Restart Refinement (RR) module refines the overall quality of the generated image by iteratively adding noise and denoising the output. This process enhances the model's predicted clean images, and leads to more realistic visual details such as the eyes of the bear in the figure. (iii) The Appearance-Rich Prompting (ARP) module enriches the original plain prompt based on the semantics of the condition image, producing an appearance-rich prompt and thereby providing a more detailed and semantically-aligned appearance image.

BibTeX

@misc{zhang2025richcontrolstructureappearancerichtrainingfree,
      title={RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation}, 
      author={Liheng Zhang and Lexi Pang and Hang Ye and Xiaoxuan Ma and Yizhou Wang},
      year={2025},
      eprint={2507.02792},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.02792}, 
}