reproducibilityindex.ai

Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight

Authors: Ziyuan Huang, Kaixiang Ji, Biao Gong, Zhiwu Qing, Qinglong Zhang, Kecheng Zheng, Jian Wang, Jingdong Chen, Ming Yang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results on a series of vision-language benchmarks reveal that the pre-train acceleration through Chain-of-Sight is achieved without sacrificing performance, matching or surpassing the standard pipeline of utilizing all visual tokens throughout the entire training process.
Researcher Affiliation	Collaboration	Ziyuan Huang1 Kaixiang Ji1 Biao Gong1 Zhiwu Qing2 Qinglong Zhang1 Kecheng Zheng1 Jian Wang1 Jingdong Chen1 Ming Yang1 1Ant Group 2 Huazhong University of Science and Technology
Pseudocode	No	The paper includes figures and equations to describe the framework but does not contain explicit pseudocode or algorithm blocks.
Open Source Code	No	The paper provides a general project website URL (https://chain-of-sight.github.io/) but does not include a direct link to a specific source-code repository or an explicit statement confirming the release of the code for the described methodology.
Open Datasets	Yes	For the first stage, we sample around 65M image-text data involving multiple tasks, as detailed in Table 1. Table 1 lists datasets: COYO [7], CC3M&12M [85], COCO [54], VG Cap [39], SBU [75], VQAv2 [32], GQA [35], OK-VQA [68], AOK-VQA [84], Science QA [65], Text VQA [87], OCRVQA [71], Text Caps [86], Ref COCO [38], Ref COCO+ [103], Ref COCOg [67].
Dataset Splits	No	The paper mentions sampling 65M image-text data for pre-training but does not explicitly state the train/validation/test splits (e.g., percentages or sample counts) for this training data.
Hardware Specification	No	The paper mentions '60,000 GPU hours' and 'computational workload' but does not provide specific details about the hardware used, such as GPU models, CPU types, or memory specifications for running the experiments.
Software Dependencies	No	The paper mentions using Vicuna as the language model and CLIP-ViT-L/14 as the visual encoder, and AdamW as the optimizer. However, it does not specify version numbers for any software libraries or dependencies (e.g., PyTorch version, Python version, specific library versions).
Experiment Setup	Yes	The paper provides extensive details on the experimental setup under '3.1 Experimental setup' and 'D Detailed training settings'. Table A3 lists hyperparameters such as 'Image resolution 224x2 \| 448x2', 'LLM adaptation Lo RA (r=64)', 'Optimizer Adam W', 'Optimizer hyperparameter β1 = 0.9, β2 = 0.98', 'Peak learning rate 2e-4 \| 3e-5', 'Training steps 120000 \| 30000', 'Global batch size 512 \| 256', and 'Numerical precision bfloat16'.