Libra: Building Decoupled Vision System on Large Language Models

Authors: Yifan Xu, Xiaoshan Yang, Yaguang Song, Changsheng Xu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that the dedicated design of Libra achieves a strong MLLM baseline that rivals existing works in the image-to-text scenario with merely 50 million training data, providing a new perspective for future multimodal foundation models. and 4. Experiments
Researcher Affiliation Academia 1MAIS, Institute of Automation, Chinese Academy of Sciences 2Peng Cheng Laboratory 3School of Artificial Intelligence, University of the Chinese Academy of Sciences. Correspondence to: Changsheng Xu <csxu@nlpr.ia.ac.cn>.
Pseudocode Yes Figure 7. Pseudo code for the computing process of attention differences in Fig. 4.
Open Source Code Yes Code is available at https: //github.com/Yifan Xu74/Libra.
Open Datasets Yes For pretraining, we use 50M image-text pairs randomly sampled from COYO-700M (Byeon et al., 2022) and CC12M (Changpinyo et al., 2021). We use additional 500K image-text pairs from COCO (Chen et al., 2015) training split to standardize the caption outputs.
Dataset Splits No The paper mentions using 'test-dev' or 'val' splits for external evaluation benchmarks (e.g., VQAv2, OKVQA, No Caps), but it does not explicitly provide details about a dedicated validation set or its split used during the model's own training process.
Hardware Specification Yes The multimodal pretraining stage takes 8400 NVIDIA A100-40G GPU hours and the instruction tuning stage takes 380 NVIDIA A100-40G GPU hours.
Software Dependencies No The paper mentions software components like 'LLa MA2-7B-Chat', 'CLIP-Vi T-L-336px', 'VQGAN', and 'Sentence Piece tokenizer', but does not provide specific version numbers for these or any other underlying software libraries or dependencies.
Experiment Setup Yes Table 4. Training hyperparameters of Libra in different stages. Configuration Pretraining SFT Total steps 40000 7000 Warmup steps 2000 300 Batch size 1280 128 Learning rate 1e-4 2e-5 Learning rate decay cosine decay Weight decay 0.01 Dropout ratio 0.0 Optimizer Adam W Adam ϵ 1e-8 Adam β (0.9, 0.99) Gradient clipping 1.0 Numerical precision bfloat16