Making LLaMA SEE and Draw with SEED Tokenizer
Authors: Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, Ying Shan
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 EXPERIMENT Evaluation of Causal Embeddings. We evaluate the performance of Causal Q-Former on the image-text retrieval using COCO (Lin et al., 2014) and Flickr30K (Young et al., 2014). The performance is measured by Recall@K (R@K). ... We evaluate SEED-LLa MA on a wide range of multimodal comprehension tasks including image captioning and image/video question answering. |
| Researcher Affiliation | Industry | Yuying Ge1 , Sijie Zhao1 , Ziyun Zeng2, Yixiao Ge1,2 , Chen Li2, Xintao Wang1,2, Ying Shan1,2 1Tencent AI Lab, 2ARC Lab, Tencent PCG |
| Pseudocode | No | The paper uses mathematical equations and diagrams to describe processes but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code (training and inference) and models are released in https://github.com/AILab-CVC/SEED. |
| Open Datasets | Yes | We pre-train SEED tokenizer on image-text pairs including CC3M (Sharma et al., 2018), Unsplash (Luke Chesser, 2023), LAION-COCO (Christoph et al., 2022) and MS-COCO (Chen et al., 2015). We use a large-scale dataset Web Vid-10M (Bain et al., 2021) containing videos and captions. We use publicly available MMC4 (Zhu et al., 2023b) and OBELISC (Laurençon et al., 2023) datasets. |
| Dataset Splits | Yes | We evaluate SEED-LLa MA on a wide range of multimodal comprehension tasks including image captioning and image/video question answering. Details of these benchmarks and evaluation metrics are provided in Appendix. D. As shown in Tab. 8: VQAv2 (Goyal et al., 2017) Scene Understanding QA test-dev VQA acc. ( ) OKVQA (Marino etal., 2019) External Knowledge QA val VQA acc. ( ) Viz Wiz (Gurari et al., 2018) Scene Understanding QA test-dev VQA acc. ( ). |
| Hardware Specification | Yes | We perform pretraining using two versions of LLM, Vicuna-7B and Llama2-chat-13B, with 64 A100-40G GPUs, and yield SEED-LLa MA-8B (144 hours) and SEED-LLa MA-14B (216 hours), respectively. ... The overall instruction tuning phase takes 16 hours for SEED-LLa MA-8B and 27 hours for SEED-LLa MA-14B with 32 A100-80G GPUs. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies or libraries used in the experiments. |
| Experiment Setup | Yes | Table 5: Summary of pretraining hyperparameters of SEED-LLa MA. Configuration SEED 8B SEED 14B ... Peak learning rate 1.5e-4 Warmup ratio 0.03 LR schedule Cosine decay Optimizer Adam W Optimizer hyper-parameters β1,β2, ϵ = 0.9, 0.98, le-6 Image resolution 224x224 Weight decay 0.05 Iterations 30k + 10k |