Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation

Abstract

Text-to-image (T2I) diffusion models have shown remarkable success in generating high-quality images from text prompts. Recent efforts extend these models to incorporate conditional images (e.g., depth or pose maps) for fine-grained spatial control. Among them, feature injection methods have emerged as a training-free alternative to traditional fine-tuning approaches. However, they often suffer from structural misalignment, condition leakage, and visual artifacts, especially when the condition image diverges significantly from natural RGB distributions. We investigate and identify a core limitation in existing methods: the synchronous injection of condition features fails to account for the trade-off between domain alignment and structural preservation during denoising. To address this, we propose a flexible feature injection framework that decouples the injection timestep from the denoising process. This design enables better adaptation to the evolving trade-off between domain alignment and structural preservation across diffusion steps, leading to more structure-rich generation. In addition, we introduce appearance-rich prompting and a restart refinement strategy to enhance appearance control and visual quality. Together, these designs enable training-free generation that is both structure-rich and appearance-rich. Extensive experiments show that our approach achieves state-of-the-art performance across diverse zero-shot conditioning scenarios.

Video

Pipeline

Given a condition image and a text prompt, our method generates an output image that semantically aligns with the prompt while preserving the structure of condition image. Our framework consists of three key components. (i) The Structure-Rich Injection (SRI) module selects the structure-rich features of the condition image, and inject them to the output image feature space to enable spatial control. (ii) The Appearance-Rich Prompting (ARP) module enriches the original plain prompt based on the semantics of the condition image, producing an appearance-rich prompt and thereby providing a more detailed and semantically-aligned appearance image. (iii) The Restart Refinement (RR) module refines the overall quality of the generated image by iteratively adding noise and denoising the output. This process enhances the model's predicted clean images, and leads to more realistic visual details such as the eyes of the bear in the figure.

Qualitative results for more control conditions. Our method effectively and robustly handles a variety of challenging structural images, including scenarios that are infeasible for training-based approaches.

CIFAR10 Wild Park Representation visualization

Qualitative results for more control conditions. Our method effectively and robustly handles a variety of challenging structural images, including scenarios that are infeasible for training-based approaches.

Comparing the probes learned from different algorithms

Qualitative comparison with existing methods. Our method demonstrates structure control on par with training-based approaches while delivering superior appearance fidelity. It effectively addresses common failure modes observed in training-free baselines: structure misalignment, condition leakage, and visual artifacts, generating high-quality content that adheres closely to the prompts with strong spatial alignment.

BibTeX

@misc{zhang2025richcontrolstructureappearancerichtrainingfree,
      title={RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation}, 
      author={Liheng Zhang and Lexi Pang and Hang Ye and Xiaoxuan Ma and Yizhou Wang},
      year={2025},
      eprint={2507.02792},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.02792}, 
}