What matters when building vision-language models?
Authors: Hugo Laurençon, Leo Tronchon, Matthieu Cord, Victor Sanh
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments around pre-trained models, architecture choice, data, and training methods. Our consolidation of findings includes the development of Idefics2, an efficient foundational VLM of 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks, and is often on par with models four times its size. |
| Researcher Affiliation | Collaboration | Hugo Laurençon ,1,2 Léo Tronchon ,1 Matthieu Cord2,3 Victor Sanh1 1Hugging Face 2Sorbonne Université 3valeo.ai, Paris, France |
| Pseudocode | No | No pseudocode or algorithm blocks are present in the paper. |
| Open Source Code | Yes | We release the model (base, instructed, and chat) along with the datasets created for its training. Our model is integrated in the library Transformers (Wolf et al., 2020). The code to use the model is therefore available online. |
| Open Datasets | Yes | We train our model on open datasets only. We also release the dataset we built during this work as an open-source resource. |
| Dataset Splits | Yes | We specify the splits of the evaluations in Table 8 and Table 9. (Text VQA, VQA acc., val) |
| Hardware Specification | Yes | We run the ablations on eight nodes containing eight H100s each, for up to five days. In total, we use 32 nodes of eight H100s each for 3 weeks for the multi-stage pre-training. |
| Software Dependencies | No | The paper mentions software like the Transformers library (Wolf et al., 2020) and optimizers like AdamW, but does not provide specific version numbers for all key software components (e.g., Python, PyTorch, or specific library versions). |
| Experiment Setup | Yes | We use a learning rate of 10-4 with AdamW for the optimizer, and do around 2 epochs on our training data. In the first stage, we limit the max image resolution to 384 pixels, which allows us to use a large global batch size of 2 048 (17k images and 2.5M text tokens on average). |