VisMin: Visual Minimal-Change Understanding

Authors: Rabiul Awal, Saba Ahmadi, LE ZHANG, Aishwarya Agrawal

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical experiments reveal that current VLMs exhibit notable deficiencies in understanding spatial relationships and counting abilities. Furthermore, leveraging the automated nature of our data creation process, we generate a large-scale training dataset, which we use to finetune CLIP (a foundational VLM) and Idefics2 (a multimodal large language model). Our findings show that both these models benefit significantly from fine-tuning on this data, as evident by marked improvements in fine-grained understanding across a wide range of benchmarks.
Researcher Affiliation Academia Mila Quebec AI Institute Université de Montréal {rabiul.awal,le.zhang,aishwarya.agrawal}@mila.quebec
Pseudocode No The paper describes its methodology through textual descriptions and diagrams but does not include any pseudocode or algorithm blocks.
Open Source Code Yes We release all resources including the benchmark, the training data and the finetuned model checkpoints at https://vismin.net/.
Open Datasets Yes For object and attribute edit types, that make use of natural images, the training data is sourced from VSR (images sourced from COCO) and the COCO 2017 training split (118K images), while the benchmark data is sourced from the COCO 2017 validation split (5K images).
Dataset Splits Yes For object and attribute edit types, that make use of natural images, the training data is sourced from VSR (images sourced from COCO) and the COCO 2017 training split (118K images), while the benchmark data is sourced from the COCO 2017 validation split (5K images). This ensures benchmark images are unseen during training, maintaining evaluation reliability by community standards. The Training dataset has 64,392 samples (37,017 objects, 10,352 attributes, 10,050 counting, 6,973 relations), while the Vis Min benchmark has 2,084 samples (579 objects, 294 attributes, 589 counting, 622 relations).
Hardware Specification Yes We use a total batch size of 128 with 4 A100 GPUs and retain other training protocols and hyperparameters as default from Open CLIP [3]
Software Dependencies No We use a total batch size of 128 with 4 A100 GPUs and retain other training protocols and hyperparameters as default from Open CLIP [3], including a learning rate of 1e-05, weight decay of 0.2, Adam β1 of 0.9, β2 of 0.98, an eps of 1e-06, and a cosine scheduler.
Experiment Setup Yes We use a total batch size of 128 with 4 A100 GPUs and retain other training protocols and hyperparameters as default from Open CLIP [3], including a learning rate of 1e-05, weight decay of 0.2, Adam β1 of 0.9, β2 of 0.98, an eps of 1e-06, and a cosine scheduler. The training runs for 5 epochs, and we select checkpoints based on a separate Vis Min validation set. We use 4-bit quantization, with r = 64 and α = 16 for Lo Ra, and a learning rate of 1e 5. The model was fine-tuned for one epoch with an accumulated batch size of 64.