Are Diffusion Models Vision-And-Language Reasoners?
Authors: Benno Krojer, Elinor Poole-Dayan, Vikram Voleti, Chris Pal, Siva Reddy
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | we evaluate language-conditioned generative image models on discriminative tasks to shed light on their fine-grained understanding of vision and language. A generative objective trains a model to understand how various objects and parts compose together, and often brings non-trivial emergent capabilities with it such as latent interpolation of composite concepts [Brock et al., 2019, Rombach et al., 2022]. On the other hand, discriminative vision-and-language models need only focus on the minimal information required to solve their discriminative task, which could often be spurious correlations that don t generalize [Agrawal et al., 2016]. |
| Researcher Affiliation | Collaboration | Benno Krojer Mila & Mc Gill University benno.krojer@mila.quebec Elinor Poole-Dayan Mc Gill University Vikram Voleti Mila & University of Montréal Stability AI Christopher Pal Mila & Polytechnique Montréal Canada CIFAR AI Chair Service Now Research Siva Reddy Mila & Mc Gill University Facebook CIFAR AI Chair Service Now Research |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled as "Pseudocode" or "Algorithm". It describes methods using mathematical equations and prose. |
| Open Source Code | Yes | We are releasing code and benchmark setup.1 1https://github.com/Mc Gill-NLP/diffusion-itm |
| Open Datasets | Yes | We fine-tune on the MS-COCO (109K examples) with the standard diffusion objective (see Equation 1). |
| Dataset Splits | No | The paper mentions "We select a checkpoint after 8 epochs based on hard negative validation" and "Flickr30K Text Retrieval validation takes 68 minutes on a single NVIDIA RTX A6000 GPU". While validation is used, specific numerical splits (e.g., 80/10/10) or counts for training and validation sets are not provided for any dataset. |
| Hardware Specification | Yes | Runtime: With 10 noise samples per image-text-pair evaluation on Flickr30K Text Retrieval validation takes 68 minutes on a single NVIDIA RTX A6000 GPU (compared to around 4 minutes with Open CLIP Vi T-L/14). |
| Software Dependencies | No | The paper mentions software like "Huggingface Diffusers library" but does not provide specific version numbers for any software components, programming languages, or libraries. |
| Experiment Setup | Yes | Timesteps t are sampled uniformly from [0, 1000], guidance scale is kept at 0... Instead we keep samples constant at 250 for the main zero-shot experiments in Tab. 1 and reduce it to a much more feasible number of 10 samples for other experiments... We fine-tune on the MS-COCO hard negative training set [Yuksekgonul et al., 2023] with lr = 1e 4, λ = 1.0 and batchsize 112. We select a checkpoint after 8 epochs based on hard negative validation. |