Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
Authors: Yang Jin, Kun Xu, Kun Xu, Liwei Chen, Chao Liao, Jianchao Tan, Quzhe Huang, Bin CHEN, Chengru Song, dai meng, Di ZHANG, Wenwu Ou, Kun Gai, Yadong MU
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments further showcase that it outperforms the existing models by a large margin on massive vision-language tasks. Our code and models are available at https://github.com/jy0205/La VIT. 4 EXPERIMENTS In this section, comprehensive experiments are conducted to systematically validate the effectiveness of La VIT on a wide range of vision-language tasks. |
| Researcher Affiliation | Collaboration | Yang Jin1 , Kun Xu2, Kun Xu2, Liwei Chen2, Chao Liao2, Jianchao Tan2, Quzhe Huang1, Bin Chen2, Chenyi Lei2, An Liu2, Chengru Song2, Xiaoqiang Lei2, Di Zhang2, Wenwu Ou2, Kun Gai2, Yadong Mu1 1Peking University 2Kuaishou Technology |
| Pseudocode | No | The paper includes figures illustrating the model architecture and data flow, but no formal pseudocode or algorithm blocks are provided. |
| Open Source Code | Yes | Our code and models are available at https://github.com/jy0205/LaVIT. |
| Open Datasets | Yes | It is trained for 50K steps on about 100M images from LAION400M (Schuhmann et al., 2021) with the batch size of 2048 and ρ = 1/3. For image-to-text comprehension (i.e., [image, text]), we employ about 93M samples from Conceptual Caption (Sharma et al., 2018; Changpinyo et al., 2021), SBU (Ordonez et al., 2011), and BLIP-Capfilt (Li et al., 2022). For the text-to-image synthesis (i.e., [text, image]), an additional 100M image-text pairs from the LAION-Aesthetics (A high-aesthetics image subset of LAION-5B (Schuhmann et al., 2022)) are used following Stable Diffusion. Moreover, to reduce catastrophic forgetting of the reasoning capacity in training LLM, we employ the English text corpus from Redpajama (Computer, 2023) dataset and mix it with the above image-text pairs to form the multi-modal input sequence. |
| Dataset Splits | Yes | We first quantitatively evaluate the model s zero-shot text-conditional image synthesis performance on the validation set of the MS-COCO benchmark (Lin et al., 2014). |
| Hardware Specification | Yes | GPU Usage 256 NVIDIA A100 GPU Usage 64 NVIDIA A100 |
| Software Dependencies | No | The paper mentions using LLaMA (Touvron et al., 2023) and Stable Diffusion v1.5 (Rombach et al., 2022) as baselines or initialized components. However, it does not specify software dependencies with version numbers for core libraries or environments like Python, PyTorch, or TensorFlow. |
| Experiment Setup | Yes | Table 9: The detailed training hyperparameters of La VIT. Visual Encoder EVA-CLIP Vi T-G/14, LLM init LLa MA-1-7B, Optimizer Adam W, Optimizer Hyperparameters β1 = 0.9, β2 = 0.95, ϵ = 1e 6 (VL Pre-training) / β1 = 0.9, β2 = 0.99, ϵ = 1e 6 (Tokenizer training), Global batch size 2048, Peak learning rate of LLM 5e-5, Peak learning rate of Visual Part 1.5e-4 (VL Pre-training) / 2e-4 (Tokenizer training), Learning rate schedule Cosine, Training Steps 20K (VL Pre-training) / 50K (Tokenizer training), Warm-up steps 2k (VL Pre-training) / 4K (Tokenizer training), Weight decay 0.1 (VL Pre-training) / 0.01 (Tokenizer training), Gradient clipping 1.0, Input image resolution 224 * 224, Input sequence to LLM 2048, Numerical precision bfloat16. |