Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing
Authors: DONGXU LI, Junnan Li, Steven Hoi
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | BLIP-Diffusion achieves promising zero-shot subject-driven generation results and superior fine-tuning efficiency. For example, BLIP-Diffusion takes 40-120 fine-tuning steps to specialize for a given subject, achieving up to 20x speedup compared to Dream Booth [9]. We conduct ablation studies using 250K subject representation learning steps. Table 2 shows zero-shot evaluation results. |
| Researcher Affiliation | Industry | Dongxu Li , Junnan Li , Steven C.H. Hoi Salesforce AI Research Corresponding authors: EMAIL |
| Pseudocode | No | The paper describes algorithms and processes in text and figures, but does not include any formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion |
| Open Datasets | Yes | For multimodal representation learning, we follow BLIP-2 [12] and pretrain the model on 129M image-text pairs, including 115M image-text pairs from LAION [28] with Cap Filt [29] captions, COCO [30], Visual Genome [31] and Conceptual Captions [32, 33]. For subject representation learning, we use a subset of 292K images from Open Image-V6 [22]. |
| Dataset Splits | No | The paper mentions selecting checkpoints based on 'validation prompts' during fine-tuning, but it does not specify explicit dataset splits (e.g., percentages, sample counts) for training, validation, or testing for either the large-scale pre-training or the subject-specific fine-tuning. |
| Hardware Specification | Yes | We fine-tune models on a single A100 (40Gb) GPU and select checkpoints manually based on a set of validation prompts. We use a total batch size 16 with a constant learning rate 2e-6 for 500K steps using Adam W [26] optimizer, taking 6 days to finish on 16 A100 40Gb GPUs. |
| Software Dependencies | No | The paper mentions using Adam W [26] optimizer, but does not provide specific version numbers for any software libraries, programming languages, or other dependencies. |
| Experiment Setup | Yes | We use a total batch size 16 with a constant learning rate 2e-6 for 500K steps using Adam W [26] optimizer, taking 6 days to finish on 16 A100 40Gb GPUs. For all fine-tuning experiments, we use Adam W [26] optimizer with constant learning rate 5e-6 and no warm-up steps. We use batch size 3, adam beta1 0.9, adam beta2 0.999, adam epsilon 1e-8 and weight decay 0.01. For inference, we use PNDM scheduler [39] for 100 denoising steps. We use a fixed guidance scale 7.5 for all experiments. |