Image Understanding Makes for A Good Tokenizer for Image Generation

Authors: Luting Wang, Yang Zhao, Zijian Zhang, Jiashi Feng, Si Liu, Bingyi Kang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our study involves training three components within the AR framework: tokenizer, decoder, and proposal network. ... The models are then evaluated using various metrics, including codebook usage, Fr echet Inception Distance (FID) [14], Inception Score (IS) [34], perplexity (PPL), etc. ... We evaluate the class-conditional IG performance of VQGAN, FSQ, and VQ-KD tokenizers on IN-1k. The results in Tab. 1 leads to the following observations.
Researcher Affiliation Collaboration Luting Wang Yang Zhao1 Zijian Zhang1 Jiashi Feng1 Si Liu Bingyi Kang1 1Byte Dance *Work done during an internship at Byte Dance. Email: wangluting@buaa.edu.cn. Corresponding authors. Email: liusi@buaa.edu.cn, bingyikang@bytedance.com. Equal contribution. Project lead.
Pseudocode No The paper describes methods in text and figures, but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes The code is released at https://github.com/magic-research/vector_quantization.
Open Datasets Yes The experiments are conducted on two image datasets: Image Net-1k (IN-1k) [7] and MSCOCO [23].
Dataset Splits Yes The IN-1k dataset contains approximately 1.28 million training images and 50, 000 validation images across 1, 000 diverse categories. The MS-COCO dataset comprises 82, 783 images for training and 40, 504 for validation.
Hardware Specification Yes Experiments are performed using 8 A100 80GB GPUs.
Software Dependencies No The paper mentions 'an Adam W optimizer is utilized' and 'decoders utilize the Adam [18] optimizer', but does not specify versions for software libraries like PyTorch, TensorFlow, or Python.
Experiment Setup Yes The learning rate warms up linearly to 10 4 for 25, 000 steps, subsequently decaying to 10 5 under a cosine schedule. Unless specifically stated, VQ-KD tokenizer is trained with an input size of 224 224 and codebook dimension of 32. Both D and PAR training span 260, 000 steps with a collective batch size of 96 for IN-1k and 24 for MS-COCO. The decoders utilize the Adam [18] optimizer with learning rates set at 5.4 10 5, β1 = 0.5, and β2 = 0.9.