reproducibilityindex.ai

4M: Massively Multimodal Masked Modeling

Authors: David Mizrahi, Roman Bachmann, Oguzhan Kar, Teresa Yeo, Mingfei Gao, Afshin Dehghan, Amir Zamir

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through experimental analyses, we demonstrate the potential of 4M for training versatile and scalable foundation models for vision tasks, setting the stage for further exploration in multimodal learning for vision and other domains.
Researcher Affiliation	Collaboration	1Swiss Federal Institute of Technology Lausanne (EPFL) 2Apple
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	Code, models, and additional interactive visualizations are available at https://4m.epfl.ch.
Open Datasets	Yes	We resort to pseudo labeling [42, 5] the publicly available Conceptual Captions 12M (CC12M) [19] as a binding dataset... The transfer tasks include Image Net-1K classification [29, 96], COCO detection and instance segmentation [67], ADE20K semantic segmentation [131], and NYUv2 depth estimation [102].
Dataset Splits	Yes	For Taskonomy-20K, we transfer to depth, principal curvature, reshading, occlusion edges, 2D edges, 2D keypoints, and 3D keypoints. For Hypersim, we transfer to surface normals, semantic segmentation, and 2D edges. ... Taskonomy-20K contains 20 000 training images and 2000 validation images.
Hardware Specification	Yes	4M-B was trained in 1.5 days on 64 A100 GPUs, 4M-L was trained in 3 days on 128 A100 GPUs, while training 4M-XL took 8 days on 128 A100 GPUs.
Software Dependencies	No	The paper mentions using Adam W and PyTorch (implied through FSDP), but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	The training details for the 4M models used for the transfer experiments (Section 3) and the generation results (Section 4) are shown in Table 5... These adjustments include reducing the training resolution when fine-tuning on COCO, and reducing the number of epochs for intermediate fine-tuning on Image Net-21K.