Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models
Authors: Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-Bonilla, Amit Alfassy, Rameswar Panda, Raja Giryes, Rogerio Feris, Shimon Ullman, Leonid Karlinsky
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Applied to CLIP, we demonstrate its significant compositional reasoning performance increase of up to 27% over the base model, up to 20% over the strongest baseline, and by 6.7% on average. |
| Researcher Affiliation | Collaboration | 1IBM Research, 2Weizmann Institute of Science, 3Tel-Aviv University, 4MIT-IBM Watson AI Lab, 5Technion, 6Korea University, 7Rice University |
| Pseudocode | No | The paper describes the method flow in text and with a diagram (Figure 2), but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | Our code is provided in the Supplementary, and it will be released upon acceptance together with our trained weights. |
| Open Datasets | Yes | We use the Conceptual Captions 3M (CC3M) dataset [75] to finetune CLIP... |
| Dataset Splits | No | The paper mentions using the CC3M dataset for finetuning and evaluates on VL-Checklist, ARO, and Elevater, but does not explicitly provide the training/validation/test splits for the CC3M dataset used for finetuning, nor for the evaluation datasets beyond indicating they are benchmarks. |
| Hardware Specification | Yes | We used 6 v100 GPUs for 12 hours to train a model. |
| Software Dependencies | No | The paper mentions software components such as PyTorch, LAVIS implementation of BLIP2, OPT 6.7B LLM, ViT-H SAM model, and GPT-NEO-2.7B LLM, but does not provide specific version numbers for these software dependencies (e.g., PyTorch version). |
| Experiment Setup | Yes | During training, we set the batch size to 128 when training without density expansion (for ablations) and to 32 with density expansions. We set the learning rate 5.0e-4, and use the Adam W optimizer over 5 epochs initializing with the CLIP weights. |