Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Authors: Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, Xichen Pan, Rob Fergus, Yann LeCun, Saining Xie

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations, offering new insights into different models and architectures selfsupervised, strongly supervised, or combinations thereof based on experiments with over 20 vision encoders. We critically examine existing MLLM benchmarks, addressing the difficulties involved in consolidating and interpreting results from various tasks, and introduce a new vision-centric benchmark, CV-Bench.
Researcher Affiliation Academia Shengbang Tong Ellis Brown Penghao Wu Sanghyun Woo Manoj Middepogu Sai Charitha Akula Jihan Yang Shusheng Yang Adithya Iyer Xichen Pan Austin Wang Rob Fergus Yann Le Cun Saining Xie New York University
Pseudocode No The paper provides mathematical formulations and descriptions of processes, particularly in Section 3 regarding the Spatial Vision Aggregator, but it does not include explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes We provide model weights, code, supporting tools, datasets, and detailed instructiontuning and evaluation recipes. Project page: https://cambrian-mllm.github.io/
Open Datasets Yes By repurposing standard vision benchmarks [18, 79, 154]... For our experiments, we tune a set of MLLMs using Vicuna-1.5-7B as the LLM backbone and each of our 23 vision models (Table 6) as the visual encoder. We use a 737K instruction tuning data mix for all experiments here (see Appendix H). ... We create a large pool of instruction tuning data, which we refer to as Cambrian10M.
Dataset Splits No The paper describes using various instruction tuning data mixes (e.g., 737K, 5M, Cambrian-7M) for training and evaluates on existing benchmarks, but it does not explicitly define a separate 'validation split' from its own created instruction tuning datasets (Cambrian-10M or Cambrian-7M) that is distinct from training and testing.
Hardware Specification Yes All models in this paper were trained using TPU-V4 pods [60]; we evaluate using NVIDIA A6000, A100, and H100 cards.
Software Dependencies No The paper mentions key software components like 'Torch XLA with FSDP [150]' and 'Hugging Face Transformers & Accelerate'. However, it does not specify explicit version numbers for these software packages, which is necessary for full reproducibility.
Experiment Setup Yes For our experiments, we tune a set of MLLMs using Vicuna-1.5-7B as the LLM backbone and each of our 23 vision models (Table 6) as the visual encoder. We use a 737K instruction tuning data mix for all experiments here (see Appendix H). All hyperparameters are matched across each experimental setting highlighting the impact of different tuning strategies with each visual encoder. All experimental settings and results are tabulated in Appendix F.2. ... Table 23: Implementation details and hyperparameters for all experiments.