Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight
Authors: Ziyuan Huang, Kaixiang Ji, Biao Gong, Zhiwu Qing, Qinglong Zhang, Kecheng Zheng, Jian Wang, Jingdong Chen, Ming Yang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results on a series of vision-language benchmarks reveal that the pre-train acceleration through Chain-of-Sight is achieved without sacrificing performance, matching or surpassing the standard pipeline of utilizing all visual tokens throughout the entire training process. |
| Researcher Affiliation | Collaboration | Ziyuan Huang1 Kaixiang Ji1 Biao Gong1 Zhiwu Qing2 Qinglong Zhang1 Kecheng Zheng1 Jian Wang1 Jingdong Chen1 Ming Yang1 1Ant Group 2 Huazhong University of Science and Technology |
| Pseudocode | No | The paper includes figures and equations to describe the framework but does not contain explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a general project website URL (https://chain-of-sight.github.io/) but does not include a direct link to a specific source-code repository or an explicit statement confirming the release of the code for the described methodology. |
| Open Datasets | Yes | For the first stage, we sample around 65M image-text data involving multiple tasks, as detailed in Table 1. Table 1 lists datasets: COYO [7], CC3M&12M [85], COCO [54], VG Cap [39], SBU [75], VQAv2 [32], GQA [35], OK-VQA [68], AOK-VQA [84], Science QA [65], Text VQA [87], OCRVQA [71], Text Caps [86], Ref COCO [38], Ref COCO+ [103], Ref COCOg [67]. |
| Dataset Splits | No | The paper mentions sampling 65M image-text data for pre-training but does not explicitly state the train/validation/test splits (e.g., percentages or sample counts) for this training data. |
| Hardware Specification | No | The paper mentions '60,000 GPU hours' and 'computational workload' but does not provide specific details about the hardware used, such as GPU models, CPU types, or memory specifications for running the experiments. |
| Software Dependencies | No | The paper mentions using Vicuna as the language model and CLIP-ViT-L/14 as the visual encoder, and AdamW as the optimizer. However, it does not specify version numbers for any software libraries or dependencies (e.g., PyTorch version, Python version, specific library versions). |
| Experiment Setup | Yes | The paper provides extensive details on the experimental setup under '3.1 Experimental setup' and 'D Detailed training settings'. Table A3 lists hyperparameters such as 'Image resolution 224x2 | 448x2', 'LLM adaptation Lo RA (r=64)', 'Optimizer Adam W', 'Optimizer hyperparameter β1 = 0.9, β2 = 0.98', 'Peak learning rate 2e-4 | 3e-5', 'Training steps 120000 | 30000', 'Global batch size 512 | 256', and 'Numerical precision bfloat16'. |