4M: Massively Multimodal Masked Modeling

Authors: David Mizrahi, Roman Bachmann, Oguzhan Kar, Teresa Yeo, Mingfei Gao, Afshin Dehghan, Amir Zamir

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through experimental analyses, we demonstrate the potential of 4M for training versatile and scalable foundation models for vision tasks, setting the stage for further exploration in multimodal learning for vision and other domains.
Researcher Affiliation Collaboration 1Swiss Federal Institute of Technology Lausanne (EPFL) 2Apple
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Code, models, and additional interactive visualizations are available at https://4m.epfl.ch.
Open Datasets Yes We resort to pseudo labeling [42, 5] the publicly available Conceptual Captions 12M (CC12M) [19] as a binding dataset... The transfer tasks include Image Net-1K classification [29, 96], COCO detection and instance segmentation [67], ADE20K semantic segmentation [131], and NYUv2 depth estimation [102].
Dataset Splits Yes For Taskonomy-20K, we transfer to depth, principal curvature, reshading, occlusion edges, 2D edges, 2D keypoints, and 3D keypoints. For Hypersim, we transfer to surface normals, semantic segmentation, and 2D edges. ... Taskonomy-20K contains 20 000 training images and 2000 validation images.
Hardware Specification Yes 4M-B was trained in 1.5 days on 64 A100 GPUs, 4M-L was trained in 3 days on 128 A100 GPUs, while training 4M-XL took 8 days on 128 A100 GPUs.
Software Dependencies No The paper mentions using Adam W and PyTorch (implied through FSDP), but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes The training details for the 4M models used for the transfer experiments (Section 3) and the generation results (Section 4) are shown in Table 5... These adjustments include reducing the training resolution when fine-tuning on COCO, and reducing the number of epochs for intermediate fine-tuning on Image Net-21K.