Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models

Authors: Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-Bonilla, Amit Alfassy, Rameswar Panda, Raja Giryes, Rogerio Feris, Shimon Ullman, Leonid Karlinsky

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Applied to CLIP, we demonstrate its significant compositional reasoning performance increase of up to 27% over the base model, up to 20% over the strongest baseline, and by 6.7% on average.
Researcher Affiliation Collaboration 1IBM Research, 2Weizmann Institute of Science, 3Tel-Aviv University, 4MIT-IBM Watson AI Lab, 5Technion, 6Korea University, 7Rice University
Pseudocode No The paper describes the method flow in text and with a diagram (Figure 2), but does not include any explicit pseudocode or algorithm blocks.
Open Source Code No Our code is provided in the Supplementary, and it will be released upon acceptance together with our trained weights.
Open Datasets Yes We use the Conceptual Captions 3M (CC3M) dataset [75] to finetune CLIP...
Dataset Splits No The paper mentions using the CC3M dataset for finetuning and evaluates on VL-Checklist, ARO, and Elevater, but does not explicitly provide the training/validation/test splits for the CC3M dataset used for finetuning, nor for the evaluation datasets beyond indicating they are benchmarks.
Hardware Specification Yes We used 6 v100 GPUs for 12 hours to train a model.
Software Dependencies No The paper mentions software components such as PyTorch, LAVIS implementation of BLIP2, OPT 6.7B LLM, ViT-H SAM model, and GPT-NEO-2.7B LLM, but does not provide specific version numbers for these software dependencies (e.g., PyTorch version).
Experiment Setup Yes During training, we set the batch size to 128 when training without density expansion (for ablations) and to 32 with density expansions. We set the learning rate 5.0e-4, and use the Adam W optimizer over 5 epochs initializing with the CLIP weights.