Are Diffusion Models Vision-And-Language Reasoners?

Authors: Benno Krojer, Elinor Poole-Dayan, Vikram Voleti, Chris Pal, Siva Reddy

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental we evaluate language-conditioned generative image models on discriminative tasks to shed light on their fine-grained understanding of vision and language. A generative objective trains a model to understand how various objects and parts compose together, and often brings non-trivial emergent capabilities with it such as latent interpolation of composite concepts [Brock et al., 2019, Rombach et al., 2022]. On the other hand, discriminative vision-and-language models need only focus on the minimal information required to solve their discriminative task, which could often be spurious correlations that don t generalize [Agrawal et al., 2016].
Researcher Affiliation Collaboration Benno Krojer Mila & Mc Gill University benno.krojer@mila.quebec Elinor Poole-Dayan Mc Gill University Vikram Voleti Mila & University of Montréal Stability AI Christopher Pal Mila & Polytechnique Montréal Canada CIFAR AI Chair Service Now Research Siva Reddy Mila & Mc Gill University Facebook CIFAR AI Chair Service Now Research
Pseudocode No The paper does not contain any sections or figures explicitly labeled as "Pseudocode" or "Algorithm". It describes methods using mathematical equations and prose.
Open Source Code Yes We are releasing code and benchmark setup.1 1https://github.com/Mc Gill-NLP/diffusion-itm
Open Datasets Yes We fine-tune on the MS-COCO (109K examples) with the standard diffusion objective (see Equation 1).
Dataset Splits No The paper mentions "We select a checkpoint after 8 epochs based on hard negative validation" and "Flickr30K Text Retrieval validation takes 68 minutes on a single NVIDIA RTX A6000 GPU". While validation is used, specific numerical splits (e.g., 80/10/10) or counts for training and validation sets are not provided for any dataset.
Hardware Specification Yes Runtime: With 10 noise samples per image-text-pair evaluation on Flickr30K Text Retrieval validation takes 68 minutes on a single NVIDIA RTX A6000 GPU (compared to around 4 minutes with Open CLIP Vi T-L/14).
Software Dependencies No The paper mentions software like "Huggingface Diffusers library" but does not provide specific version numbers for any software components, programming languages, or libraries.
Experiment Setup Yes Timesteps t are sampled uniformly from [0, 1000], guidance scale is kept at 0... Instead we keep samples constant at 250 for the main zero-shot experiments in Tab. 1 and reduce it to a much more feasible number of 10 samples for other experiments... We fine-tune on the MS-COCO hard negative training set [Yuksekgonul et al., 2023] with lr = 1e 4, λ = 1.0 and batchsize 112. We select a checkpoint after 8 epochs based on hard negative validation.