Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

Authors: Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, Dorsa Sadigh

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To address these gaps, we first compile a suite of standardized evaluations spanning visual question answering, object localization, and challenge sets that probe properties such as hallucination; evaluations that provide fine-grained insight VLM capabilities. Second, we rigorously investigate VLMs along key design axes...
Researcher Affiliation Collaboration 1Department of Computer Science, Stanford University, Stanford, CA, USA 2Toyota Research Institute, Los Altos, CA, USA.
Pseudocode No The paper describes procedures in text, but does not contain a dedicated pseudocode or algorithm block.
Open Source Code Yes 1We release our optimized training codebase, evaluation suite, and checkpoints for all models trained as part of this work. github.com/TRI-ML/prismatic-vlms github.com/TRI-ML/vlm-evaluation
Open Datasets Yes Specifically, we use the LLa Va v1.5 data mixture, which consists of two subsets used for a multi-stage training pipeline. The first subset consists of a 558K sample mixture of examples sourced from various captioning datasets (e.g., Conceptual Captions, LAION Sharma et al., 2018; Schuhmann et al., 2021), while the second consists of 665K multimodal instruct tuning examples comprised of synthetic data generated in Liu et al. (2023c), as well as examples from existing vision-language training sets (e.g., GQA, Text Caps; Hudson & Manning, 2019; Sidorov et al., 2020), and notably, a sample of language-only data from Share GPT (Share GPT, 2023).
Dataset Splits Yes We use the validation sets for all benchmarks except GQA (where use the recommended the test-dev split), VSR (where we use the zero-shot test split), and POPE (where there is only a single evaluation split).
Hardware Specification Yes when benchmarked on the same hardware (an AWS p4de.24xlarge node with 8 A100 GPUs)
Software Dependencies No We implement our training codebase in Py Torch, using Fully Sharded Data Parallel (FSDP; Zhao et al., 2023) and BF16 mixed precision. We leverage TIMM (Wightman, 2019) and Hugging Face Transformers (Wolf et al., 2019) to provide pretrained models. While these software components are mentioned, specific version numbers for them are not provided.
Experiment Setup Yes Table 1. Training Hyperparameters Hyperparameter Value Batch Size 128 Max Gradient Norm 1.0 Weight Decay 0.1 Learning Rate 2e-5 Optimizer Adam W Scheduler Warmup & Cosine Decay Warmup Ratio 0.03