reproducibilityindex.ai

BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing

Authors: DONGXU LI, Junnan Li, Steven Hoi

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	BLIP-Diffusion achieves promising zero-shot subject-driven generation results and superior fine-tuning efficiency. For example, BLIP-Diffusion takes 40-120 fine-tuning steps to specialize for a given subject, achieving up to 20x speedup compared to Dream Booth [9]. We conduct ablation studies using 250K subject representation learning steps. Table 2 shows zero-shot evaluation results.
Researcher Affiliation	Industry	Dongxu Li , Junnan Li , Steven C.H. Hoi Salesforce AI Research Corresponding authors: {li.d,junnan.li,shoi}@salesforce.com
Pseudocode	No	The paper describes algorithms and processes in text and figures, but does not include any formal pseudocode or algorithm blocks.
Open Source Code	Yes	https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion
Open Datasets	Yes	For multimodal representation learning, we follow BLIP-2 [12] and pretrain the model on 129M image-text pairs, including 115M image-text pairs from LAION [28] with Cap Filt [29] captions, COCO [30], Visual Genome [31] and Conceptual Captions [32, 33]. For subject representation learning, we use a subset of 292K images from Open Image-V6 [22].
Dataset Splits	No	The paper mentions selecting checkpoints based on 'validation prompts' during fine-tuning, but it does not specify explicit dataset splits (e.g., percentages, sample counts) for training, validation, or testing for either the large-scale pre-training or the subject-specific fine-tuning.
Hardware Specification	Yes	We fine-tune models on a single A100 (40Gb) GPU and select checkpoints manually based on a set of validation prompts. We use a total batch size 16 with a constant learning rate 2e-6 for 500K steps using Adam W [26] optimizer, taking 6 days to finish on 16 A100 40Gb GPUs.
Software Dependencies	No	The paper mentions using Adam W [26] optimizer, but does not provide specific version numbers for any software libraries, programming languages, or other dependencies.
Experiment Setup	Yes	We use a total batch size 16 with a constant learning rate 2e-6 for 500K steps using Adam W [26] optimizer, taking 6 days to finish on 16 A100 40Gb GPUs. For all fine-tuning experiments, we use Adam W [26] optimizer with constant learning rate 5e-6 and no warm-up steps. We use batch size 3, adam beta1 0.9, adam beta2 0.999, adam epsilon 1e-8 and weight decay 0.01. For inference, we use PNDM scheduler [39] for 100 denoising steps. We use a fixed guidance scale 7.5 for all experiments.