Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
FreeControl: Efficient, Training-Free Structural Control via One-Step Attention Extraction
Authors: Jiang Lin, Xinyu Chen, Song Wu, Zhiqiu Zhang, Jizhi Zhang, Ye Wang, Qiang Tang, Qian Wang, Jian Yang, Zili Yi
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct all experiments using the FLUX.1-dev [16] model with the Flow Match Euler Discrete scheduler, a timestep range of 1000 to 400, and a guidance scale of 6.5. Quantitative results use 25 denoising steps; 50 steps are used elsewhere for improved visual quality. 4.2 Quatitative Comparison Dataset. We evaluate on 5,000 images sampled from the COCO 2017 [20] validation set, resized to 512 512. Each image is paired with its corresponding caption, which is used as the input text prompt for controlled generation. Metrics. We report FID for visual fidelity, SSIM and PSNR for low-level similarity, and CLIP-Text Similarity [31] for semantic alignment between images and prompts. Comparison Methods.We compare Free Control against five strong baselines: Control Net [47], Uni Control Net [49], Uni Control [30], Control Net++ [17], and Flux-Control Net [43, 44]. Results. Table 1 reports quantitative comparisons across several metrics. Free Control outperforms all baselines in terms of structural similarity (SSIM and PSNR), while maintaining competitive CLIP-Text alignment with prompt semantics. 5 Ablation Study 5.1 One-Step vs. Iterative Attention Extraction To validate the effectiveness of extracting attention from a single timestep, we compare our method against a baseline that mimics iterative attention extraction across multiple denoising steps similar to inversion-based or reconstruction-based strategies. In this baseline, attention matrices are extracted and injected step-by-step, rather than reused. As shown in table 1 and table 2, one-step injection achieves comparable structural fidelity while significantly reducing computational overhead. This result supports our hypothesis that structural information can be captured once and reused without loss of guidance, due to the shared purpose of structural encoding across timesteps. |
| Researcher Affiliation | Collaboration | Jiang Lin1, , Xinyu Chen1, Song Wu2, Zhiqiu Zhang1, Jizhi Zhang1, Ye Wang4, Qiang Tang3, Qian Wang2, Jian Yang1, Zili Yi1 1Nanjing University, Suzhou, China 2 JIUTIAN Research, Beijing, China 3University of British Columbia, Vancouver, Canada 4Jilin University, Changchun, China |
| Pseudocode | No | The paper describes the method and steps using mathematical equations (1) and (2) and explanatory text, but does not include a dedicated, structured pseudocode block or algorithm section. |
| Open Source Code | No | The paper has clearly provided the details needed to reproduce the method proposed to the extent to support our claim; however, we are yet unable to provide a properly formulated code regarding the method and all the experiments conducted. This paper will, however, open-source the code regarding its main method upon acceptance. |
| Open Datasets | Yes | Dataset. We evaluate on 5,000 images sampled from the COCO 2017 [20] validation set, resized to 512 512. Each image is paired with its corresponding caption, which is used as the input text prompt for controlled generation. |
| Dataset Splits | Yes | Dataset. We evaluate on 5,000 images sampled from the COCO 2017 [20] validation set, resized to 512 512. Each image is paired with its corresponding caption, which is used as the input text prompt for controlled generation. A Quantitative Results on Stylized Prompts (COCO Dataset) We construct this benchmark on the COCO [20] validation set. For each image, we retain the original as the structural reference and generate stylized prompts by combining the original caption with one of five target styles (e.g., Cyberpunk, Vaporwave). |
| Hardware Specification | Yes | Inference is performed on a single NVIDIA RTX A6000 GPU with 48 GB of memory, and the inference time is measured over 100 runs. All models are run with 25 denoising steps and produce 1024 1024 px outputs on an NVIDIA RTX A6000. |
| Software Dependencies | No | The paper mentions models and frameworks like "FLUX.1-dev [16]" and "Stable Diffusion v1.5", and uses "GPT-4o [25]" for prompt generation. However, it does not specify versions for general programming languages or libraries like Python, PyTorch, or CUDA, which are typically considered ancillary software dependencies for reproducibility. |
| Experiment Setup | Yes | We conduct all experiments using the FLUX.1-dev [16] model with the Flow Match Euler Discrete scheduler, a timestep range of 1000 to 400, and a guidance scale of 6.5. Quantitative results use 25 denoising steps; 50 steps are used elsewhere for improved visual quality. The key timestep t is fixed at 661. Attention is extracted once and injected into the last 25 transformer layers of the model s single transformer block in the quantitative evaluations, and may be reduced elsewhere to demonstrate results of lower structural control. Compositional image generation is disabled unless specifically ablated. All SD 1.5-based models are run with 20 denoising steps, and Flux-based methods including Free Control use 25 steps, following the respective official configurations. |