An Image is Worth 32 Tokens for Reconstruction and Generation

Authors: Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, Liang-Chieh Chen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Building upon Ti Tok, we conduct extensive experiments to probe the dynamics of 1D image tokenization. Our investigation studies the interplay between latent space size, model size, reconstruction fidelity, and generative quality.
Researcher Affiliation Collaboration Qihang Yu1*, Mark Weber1,2*, Xueqing Deng1, Xiaohui Shen1, Daniel Cremers2, Liang-Chieh Chen1 1 Byte Dance 2 Technical University Munich
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. Figure 3 illustrates the framework but is not presented as pseudocode.
Open Source Code Yes The code and model are available at https://github.com/bytedance/ 1d-tokenizer.
Open Datasets Yes For image reconstruction (tokenizer) at preliminary experiments, the training augmentation is confined to random cropping and flipping, following [19]. The training regimen spans a short schedule, featuring a batch size of 256 over 500k training iterations, which correlates to roughly 100 epochs on the Image Net dataset. ...We train and evaluate Ti Tok on Image Net-1K generation benchmark. This dataset spans 1000 object classes and contains 1,281,167 training images, 50,000 validation images and 100,000 test images.
Dataset Splits Yes The validation set is used to compute reconstruction FID for evaluating tokenizers. ...This dataset spans 1000 object classes and contains 1,281,167 training images, 50,000 validation images and 100,000 test images.
Hardware Specification Yes The sampling speed (de-tokenization included) is measured with an A100 GPU. ...The tokenizer training takes 64 A100-40G for 74 hours (Ti Tok-L-32), 32 A100-40G for 41 hours (Ti Tok-B-64), 32 A100-40G for 50 hours (Ti Tok-S-128), 32 A100-40G for 70 hours (Ti Tok-B-128 for resolution 512), and 64 A100-40G for 91 hours (Ti Tok-L-64 for resolution 512), respectively.
Software Dependencies No The paper mentions optimizers (Adam W) and frameworks (Mask GIT) but does not provide specific version numbers for software components or libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes Unless specified otherwise, we train all models with images of resolution H = 256 and W = 256, using the open-source Mask GIT-VQGAN [9] to supply proxy codes for training. The patch size for both tokenizer and de-tokenizer is established with f = 16, and the codebook C is configured to have N = 1024 entries with each entry a vector with 16 channels. ...In the final setting for Ti Tok training, the codebook is configured to N = 4096, and the training duration is extended to 1M iterations (200 epochs). ...The generative models are trained with a batch size of 2048 and 500k iterations to improve training efficiency. We use Adam W optimizer [46] with learning rate 2 10 4 and weight decay 0.03.