Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
Authors: Keyu Tian, Yi Jiang, Zehuan Yuan, BINGYUE PENG, Liwei Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On Image Net 256 256 benchmark, VAR significantly improve AR baseline by improving Fréchet inception distance (FID) from 18.65 to 1.73, inception score (IS) from 80.4 to 350.2, with 20 faster inference speed. |
| Researcher Affiliation | Collaboration | Keyu Tian1,2, Yi Jiang2, , Zehuan Yuan2, , Bingyue Peng2, Liwei Wang1,3, 1Center for Data Science, Peking University 2Bytedance Inc. 3State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University |
| Pseudocode | Yes | Algorithm 1: Multi-scale VQVAE Encoding |
| Open Source Code | Yes | Codes and models: https://github.com/Foundation Vision/VAR |
| Open Datasets | Yes | We trained models across 12 different sizes, from 18M to 2B parameters, on the Image Net training set [24] containing 1.28M images |
| Dataset Splits | Yes | We assessed the final test cross-entropy loss L and token prediction error rates Err on the Image Net validation set of 50,000 images [24]. |
| Hardware Specification | No | The paper mentions training compute in PFlops, but does not specify the exact hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using a 'GPT-2-like transformer architecture' and 'Adam W optimizer' but does not specify version numbers for any software libraries or dependencies (e.g., PyTorch version, CUDA version). |
| Experiment Setup | Yes | All models are trained with the similar settings: a base learning rate of 10 4 per 256 batch size, an Adam W optimizer with β1 = 0.9, β2 = 0.95, decay = 0.05, a batch size from 768 to 1024 and training epochs from 200 to 350 (depends on model size). |