Making LLaMA SEE and Draw with SEED Tokenizer

Authors: Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, Ying Shan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 EXPERIMENT Evaluation of Causal Embeddings. We evaluate the performance of Causal Q-Former on the image-text retrieval using COCO (Lin et al., 2014) and Flickr30K (Young et al., 2014). The performance is measured by Recall@K (R@K). ... We evaluate SEED-LLa MA on a wide range of multimodal comprehension tasks including image captioning and image/video question answering.
Researcher Affiliation Industry Yuying Ge1 , Sijie Zhao1 , Ziyun Zeng2, Yixiao Ge1,2 , Chen Li2, Xintao Wang1,2, Ying Shan1,2 1Tencent AI Lab, 2ARC Lab, Tencent PCG
Pseudocode No The paper uses mathematical equations and diagrams to describe processes but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes The code (training and inference) and models are released in https://github.com/AILab-CVC/SEED.
Open Datasets Yes We pre-train SEED tokenizer on image-text pairs including CC3M (Sharma et al., 2018), Unsplash (Luke Chesser, 2023), LAION-COCO (Christoph et al., 2022) and MS-COCO (Chen et al., 2015). We use a large-scale dataset Web Vid-10M (Bain et al., 2021) containing videos and captions. We use publicly available MMC4 (Zhu et al., 2023b) and OBELISC (Laurençon et al., 2023) datasets.
Dataset Splits Yes We evaluate SEED-LLa MA on a wide range of multimodal comprehension tasks including image captioning and image/video question answering. Details of these benchmarks and evaluation metrics are provided in Appendix. D. As shown in Tab. 8: VQAv2 (Goyal et al., 2017) Scene Understanding QA test-dev VQA acc. ( ) OKVQA (Marino etal., 2019) External Knowledge QA val VQA acc. ( ) Viz Wiz (Gurari et al., 2018) Scene Understanding QA test-dev VQA acc. ( ).
Hardware Specification Yes We perform pretraining using two versions of LLM, Vicuna-7B and Llama2-chat-13B, with 64 A100-40G GPUs, and yield SEED-LLa MA-8B (144 hours) and SEED-LLa MA-14B (216 hours), respectively. ... The overall instruction tuning phase takes 16 hours for SEED-LLa MA-8B and 27 hours for SEED-LLa MA-14B with 32 A100-80G GPUs.
Software Dependencies No The paper does not provide specific version numbers for software dependencies or libraries used in the experiments.
Experiment Setup Yes Table 5: Summary of pretraining hyperparameters of SEED-LLa MA. Configuration SEED 8B SEED 14B ... Peak learning rate 1.5e-4 Warmup ratio 0.03 LR schedule Cosine decay Optimizer Adam W Optimizer hyper-parameters β1,β2, ϵ = 0.9, 0.98, le-6 Image resolution 224x224 Weight decay 0.05 Iterations 30k + 10k