Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Authors: Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, Xichen Pan, Rob Fergus, Yann LeCun, Saining Xie
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations, offering new insights into different models and architectures selfsupervised, strongly supervised, or combinations thereof based on experiments with over 20 vision encoders. We critically examine existing MLLM benchmarks, addressing the difficulties involved in consolidating and interpreting results from various tasks, and introduce a new vision-centric benchmark, CV-Bench. |
| Researcher Affiliation | Academia | Shengbang Tong Ellis Brown Penghao Wu Sanghyun Woo Manoj Middepogu Sai Charitha Akula Jihan Yang Shusheng Yang Adithya Iyer Xichen Pan Austin Wang Rob Fergus Yann Le Cun Saining Xie New York University |
| Pseudocode | No | The paper provides mathematical formulations and descriptions of processes, particularly in Section 3 regarding the Spatial Vision Aggregator, but it does not include explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | We provide model weights, code, supporting tools, datasets, and detailed instructiontuning and evaluation recipes. Project page: https://cambrian-mllm.github.io/ |
| Open Datasets | Yes | By repurposing standard vision benchmarks [18, 79, 154]... For our experiments, we tune a set of MLLMs using Vicuna-1.5-7B as the LLM backbone and each of our 23 vision models (Table 6) as the visual encoder. We use a 737K instruction tuning data mix for all experiments here (see Appendix H). ... We create a large pool of instruction tuning data, which we refer to as Cambrian10M. |
| Dataset Splits | No | The paper describes using various instruction tuning data mixes (e.g., 737K, 5M, Cambrian-7M) for training and evaluates on existing benchmarks, but it does not explicitly define a separate 'validation split' from its own created instruction tuning datasets (Cambrian-10M or Cambrian-7M) that is distinct from training and testing. |
| Hardware Specification | Yes | All models in this paper were trained using TPU-V4 pods [60]; we evaluate using NVIDIA A6000, A100, and H100 cards. |
| Software Dependencies | No | The paper mentions key software components like 'Torch XLA with FSDP [150]' and 'Hugging Face Transformers & Accelerate'. However, it does not specify explicit version numbers for these software packages, which is necessary for full reproducibility. |
| Experiment Setup | Yes | For our experiments, we tune a set of MLLMs using Vicuna-1.5-7B as the LLM backbone and each of our 23 vision models (Table 6) as the visual encoder. We use a 737K instruction tuning data mix for all experiments here (see Appendix H). All hyperparameters are matched across each experimental setting highlighting the impact of different tuning strategies with each visual encoder. All experimental settings and results are tabulated in Appendix F.2. ... Table 23: Implementation details and hyperparameters for all experiments. |