What matters when building vision-language models?

Authors: Hugo Laurençon, Leo Tronchon, Matthieu Cord, Victor Sanh

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments around pre-trained models, architecture choice, data, and training methods. Our consolidation of findings includes the development of Idefics2, an efficient foundational VLM of 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks, and is often on par with models four times its size.
Researcher Affiliation Collaboration Hugo Laurençon ,1,2 Léo Tronchon ,1 Matthieu Cord2,3 Victor Sanh1 1Hugging Face 2Sorbonne Université 3valeo.ai, Paris, France
Pseudocode No No pseudocode or algorithm blocks are present in the paper.
Open Source Code Yes We release the model (base, instructed, and chat) along with the datasets created for its training. Our model is integrated in the library Transformers (Wolf et al., 2020). The code to use the model is therefore available online.
Open Datasets Yes We train our model on open datasets only. We also release the dataset we built during this work as an open-source resource.
Dataset Splits Yes We specify the splits of the evaluations in Table 8 and Table 9. (Text VQA, VQA acc., val)
Hardware Specification Yes We run the ablations on eight nodes containing eight H100s each, for up to five days. In total, we use 32 nodes of eight H100s each for 3 weeks for the multi-stage pre-training.
Software Dependencies No The paper mentions software like the Transformers library (Wolf et al., 2020) and optimizers like AdamW, but does not provide specific version numbers for all key software components (e.g., Python, PyTorch, or specific library versions).
Experiment Setup Yes We use a learning rate of 10-4 with AdamW for the optimizer, and do around 2 epochs on our training data. In the first stage, we limit the max image resolution to 384 pixels, which allows us to use a large global batch size of 2 048 (17k images and 2.5M text tokens on average).