RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation

Peking University

*Equal Contribution
Interpolate start reference image.

TL;DR: We propose a training-free framework that enables high-quality text-to-image spatial control under arbitrary conditions. By introducing rich structure and appearance control, our method effectively addresses key limitations of prior work, including structure misalignment, condition leakage, and artifacts.

Abstract

Text-to-image (T2I) diffusion models have shown remarkable success in generating high-quality images from text prompts. Recent efforts extend these models to incorporate conditional images (e.g., depth or pose maps) for fine-grained spatial control. Among them, feature injection methods have emerged as a training-free alternative to traditional fine-tuning approaches. However, they often suffer from structural misalignment, condition leakage, and visual artifacts, especially when the condition image diverges significantly from natural RGB distributions. We investigate and identify a core limitation in existing methods: the synchronous injection of condition features fails to account for the trade-off between domain alignment and structural preservation during denoising. To address this, we propose a flexible feature injection framework that decouples the injection timestep from the denoising process. This design enables better adaptation to the evolving trade-off between domain alignment and structural preservation across diffusion steps, leading to more structure-rich generation. In addition, we introduce appearance-rich prompting and a restart refinement strategy to enhance appearance control and visual quality. Together, these designs enable training-free generation that is both structure-rich and appearance-rich. Extensive experiments show that our approach achieves state-of-the-art performance across diverse zero-shot conditioning scenarios.

Video

Pipeline

Interpolate start reference image.

Given a condition image and a text prompt, our method generates an output image that semantically aligns with the prompt while preserving the structure of condition image. Our framework consists of three key components. (i) The Structure-Rich Injection (SRI) module selects the structure-rich features of the condition image, and inject them to the output image feature space to enable spatial control. (ii) The Appearance-Rich Prompting (ARP) module enriches the original plain prompt based on the semantics of the condition image, producing an appearance-rich prompt and thereby providing a more detailed and semantically-aligned appearance image. (iii) The Restart Refinement (RR) module refines the overall quality of the generated image by iteratively adding noise and denoising the output. This process enhances the model's predicted clean images, and leads to more realistic visual details such as the eyes of the bear in the figure.

BibTeX

@misc{zhang2025richcontrolstructureappearancerichtrainingfree,
      title={RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation}, 
      author={Liheng Zhang and Lexi Pang and Hang Ye and Xiaoxuan Ma and Yizhou Wang},
      year={2025},
      eprint={2507.02792},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.02792}, 
}