reproducibilityindex.ai

What matters when building vision-language models?

Authors: Hugo Laurençon, Leo Tronchon, Matthieu Cord, Victor Sanh

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments around pre-trained models, architecture choice, data, and training methods. Our consolidation of findings includes the development of Idefics2, an efficient foundational VLM of 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks, and is often on par with models four times its size.
Researcher Affiliation	Collaboration	Hugo Laurençon ,1,2 Léo Tronchon ,1 Matthieu Cord2,3 Victor Sanh1 1Hugging Face 2Sorbonne Université 3valeo.ai, Paris, France
Pseudocode	No	No pseudocode or algorithm blocks are present in the paper.
Open Source Code	Yes	We release the model (base, instructed, and chat) along with the datasets created for its training. Our model is integrated in the library Transformers (Wolf et al., 2020). The code to use the model is therefore available online.
Open Datasets	Yes	We train our model on open datasets only. We also release the dataset we built during this work as an open-source resource.
Dataset Splits	Yes	We specify the splits of the evaluations in Table 8 and Table 9. (Text VQA, VQA acc., val)
Hardware Specification	Yes	We run the ablations on eight nodes containing eight H100s each, for up to five days. In total, we use 32 nodes of eight H100s each for 3 weeks for the multi-stage pre-training.
Software Dependencies	No	The paper mentions software like the Transformers library (Wolf et al., 2020) and optimizers like AdamW, but does not provide specific version numbers for all key software components (e.g., Python, PyTorch, or specific library versions).
Experiment Setup	Yes	We use a learning rate of 10-4 with AdamW for the optimizer, and do around 2 epochs on our training data. In the first stage, we limit the max image resolution to 384 pixels, which allows us to use a large global batch size of 2 048 (17k images and 2.5M text tokens on average).