Text-to-image (T2I) diffusion models have shown remarkable success in generating high-quality images from text prompts. Recent efforts extend these models to incorporate conditional images (e.g., canny edge) for fine-grained spatial control. Among them, feature injection methods have emerged as a training-free alternative to traditional fine-tuning-based approaches. However, they often suffer from structural misalignment, condition leakage, and visual artifacts, especially when the condition image diverges significantly from natural RGB distributions. Through an empirical analysis of existing methods, we identify a key limitation: the sampling schedule of condition features, previously unexplored, fails to account for the evolving interplay between structure preservation and domain alignment throughout diffusion steps. Inspired by this observation, we propose a flexible training-free framework that decouples the sampling schedule of condition features from the denoising process, and systematically investigate the spectrum of feature injection schedules for a higher-quality structure guidance in the feature space. Specifically, we find that condition features sampled from a single timestep are sufficient, yielding a simple yet efficient schedule that balances structure alignment and appearance quality. We further enhance the sampling process by introducing a restart refinement schedule, and improve the visual quality with an appearance-rich prompting strategy. Together, these designs enable training-free generation that is both structure-rich and appearance-rich. Extensive experiments show that our approach achieves state-of-the-art results across diverse zero-shot conditioning scenarios.
Given a condition image and a text prompt, our method generates an output image that semantically aligns with the prompt while preserving the structure of condition image. Our framework consists of three key components. (i) The Structure-Rich Injection (SRI) module selects the structure-rich features of the condition image, and inject them to the output image feature space to enable spatial control. (ii) The Restart Refinement (RR) module refines the overall quality of the generated image by iteratively adding noise and denoising the output. This process enhances the model's predicted clean images, and leads to more realistic visual details such as the eyes of the bear in the figure. (iii) The Appearance-Rich Prompting (ARP) module enriches the original plain prompt based on the semantics of the condition image, producing an appearance-rich prompt and thereby providing a more detailed and semantically-aligned appearance image.
Qualitative results for more control conditions. Our method effectively and robustly handles a variety of challenging structural images, including scenarios that are infeasible for training-based approaches.
Qualitative results for more control conditions. Our method effectively and robustly handles a variety of challenging structural images, including scenarios that are infeasible for training-based approaches.
Qualitative comparison with existing methods. Our method demonstrates structure control on par with training-based approaches while delivering superior appearance fidelity. It effectively addresses common failure modes observed in training-free baselines: structure misalignment, condition leakage, and visual artifacts, generating high-quality content that adheres closely to the prompts with strong spatial alignment.
@misc{zhang2025richcontrolstructureappearancerichtrainingfree,
title={RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation},
author={Liheng Zhang and Lexi Pang and Hang Ye and Xiaoxuan Ma and Yizhou Wang},
year={2025},
eprint={2507.02792},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.02792},
}