SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs
Authors: Lijun Yu, Yong Cheng, Zhiruo Wang, Vivek Kumar, Wolfgang Macherey, Yanping Huang, David Ross, Irfan Essa, Yonatan Bisk, Ming-Hsuan Yang, Kevin P. Murphy, Alexander Hauptmann, Lu Jiang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approach is validated through in-context learning experiments with frozen Pa LM 2 and GPT 3.5 on a diverse set of image understanding and generation tasks. |
| Researcher Affiliation | Collaboration | Google, Carnegie Mellon University |
| Pseudocode | No | No structured pseudocode or algorithm blocks with explicit labels like 'Pseudocode' or 'Algorithm X' are found in the paper. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing the source code or a direct link to a code repository for the described methodology. |
| Open Datasets | Yes | Following the prior work [27], SPAE is trained on the Image Net ILSVRC2012 [10] dataset. |
| Dataset Splits | Yes | We use FID [16], Inception Score (IS) [33], and LPIPS [48] to compare with the image VQGAN from Mask GIT [7] on the Image Net validation set, and FVD [36] to compare the 3D-VQGAN from MAGVIT [45] on the Kinetics-600 validation set. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Adam [20] optimizer' and 'CLIP with a Vi T-L/14 [13] vision backbone' but does not specify version numbers for these or other software libraries or frameworks. |
| Experiment Setup | Yes | We train with a batch size of 256 for 450k steps. ... We use the Adam [20] optimizer with loss weights α = 1, β = 0.33, λ = 0.1, η = 0.1, φ = 10 4 and a learning rate of 10 4 following a linear warmup/cooldown and root square decay schedule. |