4M: Massively Multimodal Masked Modeling
Authors: David Mizrahi, Roman Bachmann, Oguzhan Kar, Teresa Yeo, Mingfei Gao, Afshin Dehghan, Amir Zamir
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through experimental analyses, we demonstrate the potential of 4M for training versatile and scalable foundation models for vision tasks, setting the stage for further exploration in multimodal learning for vision and other domains. |
| Researcher Affiliation | Collaboration | 1Swiss Federal Institute of Technology Lausanne (EPFL) 2Apple |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Code, models, and additional interactive visualizations are available at https://4m.epfl.ch. |
| Open Datasets | Yes | We resort to pseudo labeling [42, 5] the publicly available Conceptual Captions 12M (CC12M) [19] as a binding dataset... The transfer tasks include Image Net-1K classification [29, 96], COCO detection and instance segmentation [67], ADE20K semantic segmentation [131], and NYUv2 depth estimation [102]. |
| Dataset Splits | Yes | For Taskonomy-20K, we transfer to depth, principal curvature, reshading, occlusion edges, 2D edges, 2D keypoints, and 3D keypoints. For Hypersim, we transfer to surface normals, semantic segmentation, and 2D edges. ... Taskonomy-20K contains 20 000 training images and 2000 validation images. |
| Hardware Specification | Yes | 4M-B was trained in 1.5 days on 64 A100 GPUs, 4M-L was trained in 3 days on 128 A100 GPUs, while training 4M-XL took 8 days on 128 A100 GPUs. |
| Software Dependencies | No | The paper mentions using Adam W and PyTorch (implied through FSDP), but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | The training details for the 4M models used for the transfer experiments (Section 3) and the generation results (Section 4) are shown in Table 5... These adjustments include reducing the training resolution when fine-tuning on COCO, and reducing the number of epochs for intermediate fine-tuning on Image Net-21K. |