reproducibilityindex.ai

Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

Authors: Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, Dorsa Sadigh

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To address these gaps, we first compile a suite of standardized evaluations spanning visual question answering, object localization, and challenge sets that probe properties such as hallucination; evaluations that provide fine-grained insight VLM capabilities. Second, we rigorously investigate VLMs along key design axes...
Researcher Affiliation	Collaboration	1Department of Computer Science, Stanford University, Stanford, CA, USA 2Toyota Research Institute, Los Altos, CA, USA.
Pseudocode	No	The paper describes procedures in text, but does not contain a dedicated pseudocode or algorithm block.
Open Source Code	Yes	1We release our optimized training codebase, evaluation suite, and checkpoints for all models trained as part of this work. github.com/TRI-ML/prismatic-vlms github.com/TRI-ML/vlm-evaluation
Open Datasets	Yes	Specifically, we use the LLa Va v1.5 data mixture, which consists of two subsets used for a multi-stage training pipeline. The first subset consists of a 558K sample mixture of examples sourced from various captioning datasets (e.g., Conceptual Captions, LAION Sharma et al., 2018; Schuhmann et al., 2021), while the second consists of 665K multimodal instruct tuning examples comprised of synthetic data generated in Liu et al. (2023c), as well as examples from existing vision-language training sets (e.g., GQA, Text Caps; Hudson & Manning, 2019; Sidorov et al., 2020), and notably, a sample of language-only data from Share GPT (Share GPT, 2023).
Dataset Splits	Yes	We use the validation sets for all benchmarks except GQA (where use the recommended the test-dev split), VSR (where we use the zero-shot test split), and POPE (where there is only a single evaluation split).
Hardware Specification	Yes	when benchmarked on the same hardware (an AWS p4de.24xlarge node with 8 A100 GPUs)
Software Dependencies	No	We implement our training codebase in Py Torch, using Fully Sharded Data Parallel (FSDP; Zhao et al., 2023) and BF16 mixed precision. We leverage TIMM (Wightman, 2019) and Hugging Face Transformers (Wolf et al., 2019) to provide pretrained models. While these software components are mentioned, specific version numbers for them are not provided.
Experiment Setup	Yes	Table 1. Training Hyperparameters Hyperparameter Value Batch Size 128 Max Gradient Norm 1.0 Weight Decay 0.1 Learning Rate 2e-5 Optimizer Adam W Scheduler Warmup & Cosine Decay Warmup Ratio 0.03