reproducibilityindex.ai

Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models

Authors: Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-Bonilla, Amit Alfassy, Rameswar Panda, Raja Giryes, Rogerio Feris, Shimon Ullman, Leonid Karlinsky

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Applied to CLIP, we demonstrate its significant compositional reasoning performance increase of up to 27% over the base model, up to 20% over the strongest baseline, and by 6.7% on average.
Researcher Affiliation	Collaboration	1IBM Research, 2Weizmann Institute of Science, 3Tel-Aviv University, 4MIT-IBM Watson AI Lab, 5Technion, 6Korea University, 7Rice University
Pseudocode	No	The paper describes the method flow in text and with a diagram (Figure 2), but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	No	Our code is provided in the Supplementary, and it will be released upon acceptance together with our trained weights.
Open Datasets	Yes	We use the Conceptual Captions 3M (CC3M) dataset [75] to finetune CLIP...
Dataset Splits	No	The paper mentions using the CC3M dataset for finetuning and evaluates on VL-Checklist, ARO, and Elevater, but does not explicitly provide the training/validation/test splits for the CC3M dataset used for finetuning, nor for the evaluation datasets beyond indicating they are benchmarks.
Hardware Specification	Yes	We used 6 v100 GPUs for 12 hours to train a model.
Software Dependencies	No	The paper mentions software components such as PyTorch, LAVIS implementation of BLIP2, OPT 6.7B LLM, ViT-H SAM model, and GPT-NEO-2.7B LLM, but does not provide specific version numbers for these software dependencies (e.g., PyTorch version).
Experiment Setup	Yes	During training, we set the batch size to 128 when training without density expansion (for ablations) and to 32 with density expansions. We set the learning rate 5.0e-4, and use the Adam W optimizer over 5 epochs initializing with the CLIP weights.