Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners
Authors: Xuehai He, Weixi Feng, Tsu-Jui Fu, Varun Jampani, Arjun Reddy Akula, Pradyumna Narayana, S Basu, William Yang Wang, Xin Eric Wang
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | By comparing Discffusion with stateof-the-art methods on several benchmark datasets, we demonstrate the potential of using pre-trained diffusion models for discriminative tasks with superior results on few-shot imagetext matching. The paper contains a dedicated "5 Experiments" section detailing evaluations on multiple datasets, comparisons with baselines, and ablation studies, indicating empirical validation. |
| Researcher Affiliation | Collaboration | 1UC Santa Cruz, 2UC Santa Barbara, 3Stability AI, 4Google EMAIL. The affiliations include both academic institutions (UC Santa Cruz, UC Santa Barbara) and industry companies (Stability AI, Google). |
| Pseudocode | Yes | The overall algorithm is shown in Algorithm 2. Algorithm 1 Discffusion Training Algorithm 2 Discffusion Inference |
| Open Source Code | No | The paper mentions using third-party libraries like "Accelerate library 2" and "Hugging Face Diffusers 4" but does not provide any explicit statement or link to the authors' own source code for Discffusion. |
| Open Datasets | Yes | We use the Compositional Visual Genome (Com VG) (Krishna et al., 2017) and Ref COCOg (Yu et al., 2016) datasets to do image-text matching. Additionally, we include the VQAv2 dataset (Antol et al., 2015). Winoground (Thrush et al., 2022) and VL-checklist (Zhao et al., 2022) are also included. LAION (Schuhmann et al., 2022) dataset for pre-training. MS-COCO dataset (Lin et al., 2014). |
| Dataset Splits | Yes | We then test Discffusion under the setting where we train the model with only 5% of the dataset (Yoo et al., 2021), demonstrating its adaptation capability using limited data. We have expanded our experimentation to extreme few-shot learning by conducting tests with only 0.5% of training data (27 examples from Com VG). |
| Hardware Specification | Yes | The inference was executed in a distributed manner on a NVIDIA workstation equipped with 4 A6000 GPUs. Remarkably, it requires only a single NVIDIA V100 GPU for training. |
| Software Dependencies | No | The paper mentions using "Stable Diffusion v2.1-base with the x Former (Lefaudeux et al., 2022) and flash attention (Dao et al., 2022) implementation" and the "Accelerate library" and "Hugging Face Diffusers" but does not provide specific version numbers for these software components or libraries. |
| Experiment Setup | Yes | M is a predefined margin where we use 0.2 in our experiments. We use Stable Diffusion v2.1-base. On the Ref COCOg dataset, we sample 10 text prompts from the pool each time. The sampling was carried out using the DDIM (Song et al., 2020) method with a total of 50 steps. We set the noise level to {0.2, 0.4, 0.6, 0.8}. |