BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing
Authors: DONGXU LI, Junnan Li, Steven Hoi
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | BLIP-Diffusion achieves promising zero-shot subject-driven generation results and superior fine-tuning efficiency. For example, BLIP-Diffusion takes 40-120 fine-tuning steps to specialize for a given subject, achieving up to 20x speedup compared to Dream Booth [9]. We conduct ablation studies using 250K subject representation learning steps. Table 2 shows zero-shot evaluation results. |
| Researcher Affiliation | Industry | Dongxu Li , Junnan Li , Steven C.H. Hoi Salesforce AI Research Corresponding authors: {li.d,junnan.li,shoi}@salesforce.com |
| Pseudocode | No | The paper describes algorithms and processes in text and figures, but does not include any formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion |
| Open Datasets | Yes | For multimodal representation learning, we follow BLIP-2 [12] and pretrain the model on 129M image-text pairs, including 115M image-text pairs from LAION [28] with Cap Filt [29] captions, COCO [30], Visual Genome [31] and Conceptual Captions [32, 33]. For subject representation learning, we use a subset of 292K images from Open Image-V6 [22]. |
| Dataset Splits | No | The paper mentions selecting checkpoints based on 'validation prompts' during fine-tuning, but it does not specify explicit dataset splits (e.g., percentages, sample counts) for training, validation, or testing for either the large-scale pre-training or the subject-specific fine-tuning. |
| Hardware Specification | Yes | We fine-tune models on a single A100 (40Gb) GPU and select checkpoints manually based on a set of validation prompts. We use a total batch size 16 with a constant learning rate 2e-6 for 500K steps using Adam W [26] optimizer, taking 6 days to finish on 16 A100 40Gb GPUs. |
| Software Dependencies | No | The paper mentions using Adam W [26] optimizer, but does not provide specific version numbers for any software libraries, programming languages, or other dependencies. |
| Experiment Setup | Yes | We use a total batch size 16 with a constant learning rate 2e-6 for 500K steps using Adam W [26] optimizer, taking 6 days to finish on 16 A100 40Gb GPUs. For all fine-tuning experiments, we use Adam W [26] optimizer with constant learning rate 5e-6 and no warm-up steps. We use batch size 3, adam beta1 0.9, adam beta2 0.999, adam epsilon 1e-8 and weight decay 0.01. For inference, we use PNDM scheduler [39] for 100 denoising steps. We use a fixed guidance scale 7.5 for all experiments. |