4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

Authors: Roman Bachmann, Oguzhan Fatih Kar, David Mizrahi, Ali Garjani, Mingfei Gao, David Griffiths, Jiaming Hu, Afshin Dehghan, Amir Zamir

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate training a single model on tens of highly diverse modalities without a loss in performance compared to specialized single/few task models. In table 1, we evaluate the performance on DIODE [90] surface normal and depth estimation, COCO [57] semantic and instance segmentation, 3DPW [91] 3D human pose estimation, and do Image Net-1K [79] k NN retrieval using predicted DINOv2 global tokens.
Researcher Affiliation Collaboration 1Swiss Federal Institute of Technology Lausanne (EPFL) 2Apple
Pseudocode No The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code Yes The multimodal models and training code are open sourced at https://4m.epfl.ch.
Open Datasets Yes We co-train on several datasets to improve the model s performance and the data diversity. These include CC12M [16], which comprises about 10 million text-image samples fully pseudo labeled with all 21 modalities, and accounts for 60% of our training samples. Additionally, we include COYO700M [12], with approximately 500 million text-image samples pseudo labeled with the 7 modalities of 4M, and accounts for 20% of our training samples. Lastly, the Colossal Clean Crawled Corpus (C4) [71], a large text-only dataset, is used for language model co-training, also making up 20% of our training samples.
Dataset Splits Yes We follow the evaluation setup in [65] and evaluate on DIODE validation set at 224 224 input resolution. We compute FID and CLIP-L/14 scores on COCO validation set after resizing the generations to 256x256.
Hardware Specification Yes All models were trained on Nvidia A100 GPUs.
Software Dependencies No The paper does not list specific version numbers for software components like Python, PyTorch, or CUDA.
Experiment Setup Yes Table 6: Pre-training settings. Training configuration for 4M-21 used in the transfer experiments and generation results. (This table explicitly lists hyperparameters like Batch size, learning rate, weight decay, input/target token budget, image resolution, etc.)