reproducibilityindex.ai

Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

Authors: CHENYU YANG, Xizhou Zhu, Jinguo Zhu, Weijie Su, Junjie Wang, Xuan Dong, Wenhai Wang, Bin Li, Jie Zhou, Yu Qiao, Jifeng Dai

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that our method not only matches the performance of CLIP on paired pre-training datasets (e.g., LAION), but can also leverage interleaved pre-training data (e.g., MMC4) to learn robust visual representations from scratch, showcasing the potential of vision model pre-training with interleaved image-text data.
Researcher Affiliation	Collaboration	1Tsinghua University 2Open GVLab, Shanghai AI Laboratory 3Sense Time Research 4The Chinese University of Hong Kong 5University of Science and Technology of China 6Xi an Jiaotong University 7Beijing University of Posts and Telecommunications
Pseudocode	No	The paper includes figures illustrating architectural components and mathematical formulations, but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Code is released at https://github.com/Open GVLab/LCL.
Open Datasets	Yes	The datasets utilized in our pre-training encompass the image-text pair dataset LAION-400M [57], as well as the image-text interleaved datasets MMC4 [88] and OBELICS [36].
Dataset Splits	Yes	Model is trained on the Image Net-1K [33] train split and evaluated on val split. Image-text retrieval... trained on a combination dataset comprised of CC12M [61], CC3M [61], and SBU [77], and is tested on the MSCOCO [11] karpathy-test split and Flickr30k [54] test split. Model is trained on a subset of the LAION-COCO [58] dataset... and evaluation is performed on the MSCOCO [11] karpathy-test split and No Caps [1] val split.
Hardware Specification	Yes	Pre-training used 512 A800 GPUs and took 5 days.
Software Dependencies	No	The paper mentions using 'Adam W optimizer' and 'Mixed numerical precision training with bfloat16' but does not specify version numbers for other key software dependencies or libraries used for implementation (e.g., PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	Our pre-training configuration is shown in Tab. 7. The Adam W optimizer was employed for model training with the learning rate set to 3e-4 and the weight decay set to 0.1. Mixed numerical precision training with bfloat16 is also employed to stabilize the optimization process. Furthermore, we set a drop-path [35] rate linearly increasing to 0.2, and use layer-scale [73] for stable training.