SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs

Authors: Lijun Yu, Yong Cheng, Zhiruo Wang, Vivek Kumar, Wolfgang Macherey, Yanping Huang, David Ross, Irfan Essa, Yonatan Bisk, Ming-Hsuan Yang, Kevin P. Murphy, Alexander Hauptmann, Lu Jiang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our approach is validated through in-context learning experiments with frozen Pa LM 2 and GPT 3.5 on a diverse set of image understanding and generation tasks.
Researcher Affiliation Collaboration Google, Carnegie Mellon University
Pseudocode No No structured pseudocode or algorithm blocks with explicit labels like 'Pseudocode' or 'Algorithm X' are found in the paper.
Open Source Code No The paper does not contain an explicit statement about releasing the source code or a direct link to a code repository for the described methodology.
Open Datasets Yes Following the prior work [27], SPAE is trained on the Image Net ILSVRC2012 [10] dataset.
Dataset Splits Yes We use FID [16], Inception Score (IS) [33], and LPIPS [48] to compare with the image VQGAN from Mask GIT [7] on the Image Net validation set, and FVD [36] to compare the 3D-VQGAN from MAGVIT [45] on the Kinetics-600 validation set.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for running the experiments.
Software Dependencies No The paper mentions using 'Adam [20] optimizer' and 'CLIP with a Vi T-L/14 [13] vision backbone' but does not specify version numbers for these or other software libraries or frameworks.
Experiment Setup Yes We train with a batch size of 256 for 450k steps. ... We use the Adam [20] optimizer with loss weights α = 1, β = 0.33, λ = 0.1, η = 0.1, φ = 10 4 and a learning rate of 10 4 following a linear warmup/cooldown and root square decay schedule.