Libra: Building Decoupled Vision System on Large Language Models
Authors: Yifan Xu, Xiaoshan Yang, Yaguang Song, Changsheng Xu
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that the dedicated design of Libra achieves a strong MLLM baseline that rivals existing works in the image-to-text scenario with merely 50 million training data, providing a new perspective for future multimodal foundation models. and 4. Experiments |
| Researcher Affiliation | Academia | 1MAIS, Institute of Automation, Chinese Academy of Sciences 2Peng Cheng Laboratory 3School of Artificial Intelligence, University of the Chinese Academy of Sciences. Correspondence to: Changsheng Xu <csxu@nlpr.ia.ac.cn>. |
| Pseudocode | Yes | Figure 7. Pseudo code for the computing process of attention differences in Fig. 4. |
| Open Source Code | Yes | Code is available at https: //github.com/Yifan Xu74/Libra. |
| Open Datasets | Yes | For pretraining, we use 50M image-text pairs randomly sampled from COYO-700M (Byeon et al., 2022) and CC12M (Changpinyo et al., 2021). We use additional 500K image-text pairs from COCO (Chen et al., 2015) training split to standardize the caption outputs. |
| Dataset Splits | No | The paper mentions using 'test-dev' or 'val' splits for external evaluation benchmarks (e.g., VQAv2, OKVQA, No Caps), but it does not explicitly provide details about a dedicated validation set or its split used during the model's own training process. |
| Hardware Specification | Yes | The multimodal pretraining stage takes 8400 NVIDIA A100-40G GPU hours and the instruction tuning stage takes 380 NVIDIA A100-40G GPU hours. |
| Software Dependencies | No | The paper mentions software components like 'LLa MA2-7B-Chat', 'CLIP-Vi T-L-336px', 'VQGAN', and 'Sentence Piece tokenizer', but does not provide specific version numbers for these or any other underlying software libraries or dependencies. |
| Experiment Setup | Yes | Table 4. Training hyperparameters of Libra in different stages. Configuration Pretraining SFT Total steps 40000 7000 Warmup steps 2000 300 Batch size 1280 128 Learning rate 1e-4 2e-5 Learning rate decay cosine decay Weight decay 0.01 Dropout ratio 0.0 Optimizer Adam W Adam ϵ 1e-8 Adam β (0.9, 0.99) Gradient clipping 1.0 Numerical precision bfloat16 |