Detail++: Progressive Detail Injection for Training-Free Semantic Binding in Text-to-Image Generation

Lifeng Chen1, Jiner Wang1, Zihao Pan1, Beier Zhu1, 2, Xiaofeng Yang1, 2, Chi Zhang†1
Indicates Corresponding Author.
1AGI Lab, Westlake University,  2Nanyang Technological University.
Teaser Image A comparison between our method and current state-of-the-art generative models. The mainstream models often suffer from issues such as semantic overflow, complex attribute mismatching, and style blending. Even Flux, the leading generative model under the DiT framework, struggles to overcome these challenges. In contrast, our method, Detail++, based on SDXL, achieves highly accurate semantic binding in training-free way.

Abstract

Recent advances in text-to-image (T2I) generation have led to impressive visual results. However, these models still face significant challenges when handling complex prompts—particularly those involving multiple subjects with distinct attributes. Inspired by the human drawing process, which first outlines the composition and then incrementally adds details, we propose Detail++, a training-free framework that introduces a novel Progressive Detail Injection (PDI) strategy to address this limitation. Specifically, we decompose a complex prompt into a sequence of simplified sub-prompts, guiding the generation process in stages. This staged generation leverages the inherent layout-controlling capacity of self-attention to first ensure global composition, followed by precise refinement. To achieve accurate binding between attributes and corresponding subjects, we exploit cross-attention mechanisms and further introduce a Centroid Alignment Loss at test time to reduce binding noise and enhance attribute consistency. Extensive experiments on T2I-CompBench and a newly constructed style composition benchmark demonstrate that Detail++ significantly outperforms existing methods, particularly in scenarios involving multiple objects and complex stylistic conditions.




Overview

Paradigm Image

The basic process of Detail++. As shown in the first column, generating complex prompts in a single branch often results in inaccurate or blended attribute assignments. For example, attributes such as "sunglasses" and "necklace" may be mistakenly applied to the wrong subject. Our method addresses this challenge through a progressive approach: we first ignore all complex modifiers to produce a rough generation base, then systematically inject details to ensure each attribute is precisely added to its corresponding subject region. The prompt displayed below each image indicates the input used for that specific branch, and the second row demonstrates the method's effectiveness in style combination scenarios. Note that all four branches here are generated in parallel. It is worth noting that all four branches are generated in parallel.


Method

Paradigm Image

Overview of Detail++. Our method consists of a large framework, Progressive Detail Injection(PDI), and test-time attention nurturing based on Centroid Alignment Loss. The sub-prompts, decomposed by spaCy, are togather passed through U-Net for parallel inference. In each denoising step, the resulting batch of latents undergoes Accumulative Latent Modification (ALM), modifying the only region current adding attribute corresponding to. Note that, in our framework, the self-attention maps for all branches are unified, which ensure the consistent layout, avoiding conflict of editing effects.



Qualitative Comparison

Paradigm Image

Qualitative comparison based on complex prompts including attributes of object, color, texture and style. For prompt with multiple attributes, our method effectively avoid the problem of semantic mismatching and overflow. It is worth noting that only our method can handle the problem of style blending well.



BibTeX

@misc{chen2025detailtrainingfreeenhancertexttoimage,
        title={Detail++: Training-Free Detail Enhancer for Text-to-Image Diffusion Models}, 
        author={Lifeng Chen and Jiner Wang and Zihao Pan and Beier Zhu and Xiaofeng Yang and Chi Zhang},
        year={2025},
        eprint={2507.17853},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2507.17853}, 
  }