4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
Authors: Roman Bachmann, Oguzhan Fatih Kar, David Mizrahi, Ali Garjani, Mingfei Gao, David Griffiths, Jiaming Hu, Afshin Dehghan, Amir Zamir
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate training a single model on tens of highly diverse modalities without a loss in performance compared to specialized single/few task models. In table 1, we evaluate the performance on DIODE [90] surface normal and depth estimation, COCO [57] semantic and instance segmentation, 3DPW [91] 3D human pose estimation, and do Image Net-1K [79] k NN retrieval using predicted DINOv2 global tokens. |
| Researcher Affiliation | Collaboration | 1Swiss Federal Institute of Technology Lausanne (EPFL) 2Apple |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | The multimodal models and training code are open sourced at https://4m.epfl.ch. |
| Open Datasets | Yes | We co-train on several datasets to improve the model s performance and the data diversity. These include CC12M [16], which comprises about 10 million text-image samples fully pseudo labeled with all 21 modalities, and accounts for 60% of our training samples. Additionally, we include COYO700M [12], with approximately 500 million text-image samples pseudo labeled with the 7 modalities of 4M, and accounts for 20% of our training samples. Lastly, the Colossal Clean Crawled Corpus (C4) [71], a large text-only dataset, is used for language model co-training, also making up 20% of our training samples. |
| Dataset Splits | Yes | We follow the evaluation setup in [65] and evaluate on DIODE validation set at 224 224 input resolution. We compute FID and CLIP-L/14 scores on COCO validation set after resizing the generations to 256x256. |
| Hardware Specification | Yes | All models were trained on Nvidia A100 GPUs. |
| Software Dependencies | No | The paper does not list specific version numbers for software components like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | Table 6: Pre-training settings. Training configuration for 4M-21 used in the transfer experiments and generation results. (This table explicitly lists hyperparameters like Batch size, learning rate, weight decay, input/target token budget, image resolution, etc.) |