reproducibilityindex.ai

Making LLaMA SEE and Draw with SEED Tokenizer

Authors: Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, Ying Shan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 EXPERIMENT Evaluation of Causal Embeddings. We evaluate the performance of Causal Q-Former on the image-text retrieval using COCO (Lin et al., 2014) and Flickr30K (Young et al., 2014). The performance is measured by Recall@K (R@K). ... We evaluate SEED-LLa MA on a wide range of multimodal comprehension tasks including image captioning and image/video question answering.
Researcher Affiliation	Industry	Yuying Ge1 , Sijie Zhao1 , Ziyun Zeng2, Yixiao Ge1,2 , Chen Li2, Xintao Wang1,2, Ying Shan1,2 1Tencent AI Lab, 2ARC Lab, Tencent PCG
Pseudocode	No	The paper uses mathematical equations and diagrams to describe processes but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code (training and inference) and models are released in https://github.com/AILab-CVC/SEED.
Open Datasets	Yes	We pre-train SEED tokenizer on image-text pairs including CC3M (Sharma et al., 2018), Unsplash (Luke Chesser, 2023), LAION-COCO (Christoph et al., 2022) and MS-COCO (Chen et al., 2015). We use a large-scale dataset Web Vid-10M (Bain et al., 2021) containing videos and captions. We use publicly available MMC4 (Zhu et al., 2023b) and OBELISC (Laurençon et al., 2023) datasets.
Dataset Splits	Yes	We evaluate SEED-LLa MA on a wide range of multimodal comprehension tasks including image captioning and image/video question answering. Details of these benchmarks and evaluation metrics are provided in Appendix. D. As shown in Tab. 8: VQAv2 (Goyal et al., 2017) Scene Understanding QA test-dev VQA acc. ( ) OKVQA (Marino etal., 2019) External Knowledge QA val VQA acc. ( ) Viz Wiz (Gurari et al., 2018) Scene Understanding QA test-dev VQA acc. ( ).
Hardware Specification	Yes	We perform pretraining using two versions of LLM, Vicuna-7B and Llama2-chat-13B, with 64 A100-40G GPUs, and yield SEED-LLa MA-8B (144 hours) and SEED-LLa MA-14B (216 hours), respectively. ... The overall instruction tuning phase takes 16 hours for SEED-LLa MA-8B and 27 hours for SEED-LLa MA-14B with 32 A100-80G GPUs.
Software Dependencies	No	The paper does not provide specific version numbers for software dependencies or libraries used in the experiments.
Experiment Setup	Yes	Table 5: Summary of pretraining hyperparameters of SEED-LLa MA. Configuration SEED 8B SEED 14B ... Peak learning rate 1.5e-4 Warmup ratio 0.03 LR schedule Cosine decay Optimizer Adam W Optimizer hyper-parameters β1,β2, ϵ = 0.9, 0.98, le-6 Image resolution 224x224 Weight decay 0.05 Iterations 30k + 10k