Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Authors: CHENYU YANG, Xizhou Zhu, Jinguo Zhu, Weijie Su, Junjie Wang, Xuan Dong, Wenhai Wang, Bin Li, Jie Zhou, Yu Qiao, Jifeng Dai
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that our method not only matches the performance of CLIP on paired pre-training datasets (e.g., LAION), but can also leverage interleaved pre-training data (e.g., MMC4) to learn robust visual representations from scratch, showcasing the potential of vision model pre-training with interleaved image-text data. |
| Researcher Affiliation | Collaboration | 1Tsinghua University 2Open GVLab, Shanghai AI Laboratory 3Sense Time Research 4The Chinese University of Hong Kong 5University of Science and Technology of China 6Xi an Jiaotong University 7Beijing University of Posts and Telecommunications |
| Pseudocode | No | The paper includes figures illustrating architectural components and mathematical formulations, but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Code is released at https://github.com/Open GVLab/LCL. |
| Open Datasets | Yes | The datasets utilized in our pre-training encompass the image-text pair dataset LAION-400M [57], as well as the image-text interleaved datasets MMC4 [88] and OBELICS [36]. |
| Dataset Splits | Yes | Model is trained on the Image Net-1K [33] train split and evaluated on val split. Image-text retrieval... trained on a combination dataset comprised of CC12M [61], CC3M [61], and SBU [77], and is tested on the MSCOCO [11] karpathy-test split and Flickr30k [54] test split. Model is trained on a subset of the LAION-COCO [58] dataset... and evaluation is performed on the MSCOCO [11] karpathy-test split and No Caps [1] val split. |
| Hardware Specification | Yes | Pre-training used 512 A800 GPUs and took 5 days. |
| Software Dependencies | No | The paper mentions using 'Adam W optimizer' and 'Mixed numerical precision training with bfloat16' but does not specify version numbers for other key software dependencies or libraries used for implementation (e.g., PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | Our pre-training configuration is shown in Tab. 7. The Adam W optimizer was employed for model training with the learning rate set to 3e-4 and the weight decay set to 0.1. Mixed numerical precision training with bfloat16 is also employed to stabilize the optimization process. Furthermore, we set a drop-path [35] rate linearly increasing to 0.2, and use layer-scale [73] for stable training. |