Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing

Authors: DONGXU LI, Junnan Li, Steven Hoi

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental BLIP-Diffusion achieves promising zero-shot subject-driven generation results and superior fine-tuning efficiency. For example, BLIP-Diffusion takes 40-120 fine-tuning steps to specialize for a given subject, achieving up to 20x speedup compared to Dream Booth [9]. We conduct ablation studies using 250K subject representation learning steps. Table 2 shows zero-shot evaluation results.
Researcher Affiliation Industry Dongxu Li , Junnan Li , Steven C.H. Hoi Salesforce AI Research Corresponding authors: EMAIL
Pseudocode No The paper describes algorithms and processes in text and figures, but does not include any formal pseudocode or algorithm blocks.
Open Source Code Yes https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion
Open Datasets Yes For multimodal representation learning, we follow BLIP-2 [12] and pretrain the model on 129M image-text pairs, including 115M image-text pairs from LAION [28] with Cap Filt [29] captions, COCO [30], Visual Genome [31] and Conceptual Captions [32, 33]. For subject representation learning, we use a subset of 292K images from Open Image-V6 [22].
Dataset Splits No The paper mentions selecting checkpoints based on 'validation prompts' during fine-tuning, but it does not specify explicit dataset splits (e.g., percentages, sample counts) for training, validation, or testing for either the large-scale pre-training or the subject-specific fine-tuning.
Hardware Specification Yes We fine-tune models on a single A100 (40Gb) GPU and select checkpoints manually based on a set of validation prompts. We use a total batch size 16 with a constant learning rate 2e-6 for 500K steps using Adam W [26] optimizer, taking 6 days to finish on 16 A100 40Gb GPUs.
Software Dependencies No The paper mentions using Adam W [26] optimizer, but does not provide specific version numbers for any software libraries, programming languages, or other dependencies.
Experiment Setup Yes We use a total batch size 16 with a constant learning rate 2e-6 for 500K steps using Adam W [26] optimizer, taking 6 days to finish on 16 A100 40Gb GPUs. For all fine-tuning experiments, we use Adam W [26] optimizer with constant learning rate 5e-6 and no warm-up steps. We use batch size 3, adam beta1 0.9, adam beta2 0.999, adam epsilon 1e-8 and weight decay 0.01. For inference, we use PNDM scheduler [39] for 100 denoising steps. We use a fixed guidance scale 7.5 for all experiments.