Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
DEFT: Decompositional Efficient Fine-Tuning for Text-to-Image Models
Authors: Komal Kumar, Rao Anwer, Fahad Shahbaz Khan, Salman H Khan, Ivan Laptev, Hisham Cholakkal
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted extensive experiments on the Dreambooth and Dreambench Plus datasets for personalization, the Ins Det dataset for object and scene adaptation, and the Visual Cloze dataset for a universal image generation framework through visual in-context learning with both Stable Diffusion and a unified model. Our results demonstrated state-of-the-art performance, highlighting the emergent properties of efficient fine-tuning. |
| Researcher Affiliation | Academia | Mohamed bin Zayed University of Artificial Intelligence Abu Dhabi, UAE EMAIL |
| Pseudocode | No | The paper describes the methodology using mathematical equations and textual explanations, such as Section 3 'Methodology' and Section 3.1 'Decompositional Efficient Fine-Tuning', but does not include any distinct pseudocode blocks or algorithms formatted with numbered steps. |
| Open Source Code | Yes | Our code is available on DEFT. Everything we be available on Git Hub and Huggingface online for reproducibility. |
| Open Datasets | Yes | We conducted extensive experiments on the Dreambooth and Dreambench Plus datasets for personalization, the Ins Det dataset for object and scene adaptation, and the Visual Cloze dataset for a universal image generation framework through visual in-context learning with both Stable Diffusion and a unified model. For evaluation, we use several datasets, including the Dream Booth Benchmark [34], which contains 30 personalized concepts across 15 categories, with 4 6 images per concept and 25 challenging prompts. Additionally, we curate datasets for single-subject and multi-subject personalization, ensuring diverse compositions. Key datasets include Visual Cloze [25] (We created 3M instructions for in-context learning for fine-tuning), Dream Booth (30 concepts for subject-driven generation), Dream Bench Plus [31] (150 concepts for human-aligned benchmarks), and Ins Det [37], the high-resolution dataset for instance detection with 100 objects and 5 scenes. |
| Dataset Splits | No | The paper mentions evaluating on the "Visualcloze [25] test dataset" and that the "Dream Booth Benchmark [34]" contains "30 personalized concepts across 15 categories, with 4 6 images per concept and 25 challenging prompts." While it refers to a test set and characteristics of benchmark datasets, it does not explicitly provide the specific training/validation/test splits used for DEFT's fine-tuning experiments, such as percentages, absolute counts, or detailed splitting methodology. |
| Hardware Specification | Yes | All experiments were conducted with 4 NVIDIA RTX A6000 50GB GPUs. |
| Software Dependencies | No | The paper mentions: "We wrote the library of the DEFT using Torch... We extend stable diffusion and stable diffusion XL (SDXL) image personalization, along with Omnigen [48], into a unified model... For training and evaluation, we use the diffuser library [42]." However, it does not provide specific version numbers for Torch, Stable Diffusion, SDXL, Omnigen, or the diffuser library. |
| Experiment Setup | Yes | We use rank (r) equal to 4 for Dream Bench Plus [31] benchmarking for all of the methods. For Visual cloze [25] universal image generation, we used r = 32. To further stabilize optimization, the framework applies a higher learning rate to R compared to P, mirroring the design of Lo RA. Furthermore, we analyzed training efficiency for the rank-64 configuration with a batch size of 2 on the Omni Gen 3.762B parameter model. |