Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
Authors: Keyu Tian, Yi Jiang, Zehuan Yuan, BINGYUE PENG, Liwei Wang
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On Image Net 256 256 benchmark, VAR significantly improve AR baseline by improving Fréchet inception distance (FID) from 18.65 to 1.73, inception score (IS) from 80.4 to 350.2, with 20 faster inference speed. |
| Researcher Affiliation | Collaboration | Keyu Tian1,2, Yi Jiang2, , Zehuan Yuan2, , Bingyue Peng2, Liwei Wang1,3, 1Center for Data Science, Peking University 2Bytedance Inc. 3State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University |
| Pseudocode | Yes | Algorithm 1: Multi-scale VQVAE Encoding |
| Open Source Code | Yes | Codes and models: https://github.com/Foundation Vision/VAR |
| Open Datasets | Yes | We trained models across 12 different sizes, from 18M to 2B parameters, on the Image Net training set [24] containing 1.28M images |
| Dataset Splits | Yes | We assessed the final test cross-entropy loss L and token prediction error rates Err on the Image Net validation set of 50,000 images [24]. |
| Hardware Specification | No | The paper mentions training compute in PFlops, but does not specify the exact hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using a 'GPT-2-like transformer architecture' and 'Adam W optimizer' but does not specify version numbers for any software libraries or dependencies (e.g., PyTorch version, CUDA version). |
| Experiment Setup | Yes | All models are trained with the similar settings: a base learning rate of 10 4 per 256 batch size, an Adam W optimizer with β1 = 0.9, β2 = 0.95, decay = 0.05, a batch size from 768 to 1024 and training epochs from 200 to 350 (depends on model size). |