Jointly Training Large Autoregressive Multimodal Models

Authors: Emanuele Aiello, LILI YU, Yixin Nie, Armen Aghajanyan, Barlas Oguz

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To achieve this objective, we conduct a comprehensive empirical investigation into the fusion of two specialized autoregressive, decoder-only, large transformer models, each designed for unique tasks (one for text-to-image and a text only model).
Researcher Affiliation Collaboration Politecnico di Torino, Meta AI
Pseudocode No The paper describes its methods in prose and includes architectural diagrams, but does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about releasing the source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets Yes Text corpora We use 30B text tokens sampled from a mixture of several publicly available data, and we reuse the data used for training other common open-source LLM following the same preprocessing of (Touvron et al., 2023). The datasets are: English Common Crawl (Touvron et al., 2023), C4 (Raffel et al., 2020), Wikipedia, Books3 from The Pile (Gao et al., 2020), and ar Xiv.
Dataset Splits No The paper mentions using 'validation perplexity (PPL)' for model selection and discusses training token counts and epochs, but it does not specify the explicit training, validation, and test dataset splits (e.g., percentages or absolute sample counts) needed to reproduce the data partitioning. While MS-COCO has standard splits, the paper does not explicitly state which split they used for validation or how their custom datasets were split.
Hardware Specification Yes This training procedure takes approximately one day on 256 80GB A100s for all models.
Software Dependencies No The paper mentions specific models and tokenizers (e.g., 'VQ-VAE tokenizer', 'CM3leon') and objectives, but does not provide specific software dependency names with version numbers (e.g., 'Python 3.8', 'TensorFlow 2.x') required for reproducibility.
Experiment Setup Yes Our initial learning rate is lr = 3 10 5 we use 500 warm-up steps. We set our optimal batch size to 8M tokens. The total number of training steps is 5960.