Image Understanding Makes for A Good Tokenizer for Image Generation
Authors: Luting Wang, Yang Zhao, Zijian Zhang, Jiashi Feng, Si Liu, Bingyi Kang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our study involves training three components within the AR framework: tokenizer, decoder, and proposal network. ... The models are then evaluated using various metrics, including codebook usage, Fr echet Inception Distance (FID) [14], Inception Score (IS) [34], perplexity (PPL), etc. ... We evaluate the class-conditional IG performance of VQGAN, FSQ, and VQ-KD tokenizers on IN-1k. The results in Tab. 1 leads to the following observations. |
| Researcher Affiliation | Collaboration | Luting Wang Yang Zhao1 Zijian Zhang1 Jiashi Feng1 Si Liu Bingyi Kang1 1Byte Dance *Work done during an internship at Byte Dance. Email: wangluting@buaa.edu.cn. Corresponding authors. Email: liusi@buaa.edu.cn, bingyikang@bytedance.com. Equal contribution. Project lead. |
| Pseudocode | No | The paper describes methods in text and figures, but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is released at https://github.com/magic-research/vector_quantization. |
| Open Datasets | Yes | The experiments are conducted on two image datasets: Image Net-1k (IN-1k) [7] and MSCOCO [23]. |
| Dataset Splits | Yes | The IN-1k dataset contains approximately 1.28 million training images and 50, 000 validation images across 1, 000 diverse categories. The MS-COCO dataset comprises 82, 783 images for training and 40, 504 for validation. |
| Hardware Specification | Yes | Experiments are performed using 8 A100 80GB GPUs. |
| Software Dependencies | No | The paper mentions 'an Adam W optimizer is utilized' and 'decoders utilize the Adam [18] optimizer', but does not specify versions for software libraries like PyTorch, TensorFlow, or Python. |
| Experiment Setup | Yes | The learning rate warms up linearly to 10 4 for 25, 000 steps, subsequently decaying to 10 5 under a cosine schedule. Unless specifically stated, VQ-KD tokenizer is trained with an input size of 224 224 and codebook dimension of 32. Both D and PAR training span 260, 000 steps with a collective batch size of 96 for IN-1k and 24 for MS-COCO. The decoders utilize the Adam [18] optimizer with learning rates set at 5.4 10 5, β1 = 0.5, and β2 = 0.9. |