OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

Authors: Junke Wang, Yi Jiang, Zehuan Yuan, BINGYUE PENG, Zuxuan Wu, Yu-Gang Jiang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that Omni Tokenizer achieves state-of-the-art (SOTA) reconstruction performance on various image and video datasets, e.g., 1.11 reconstruction FID on Image Net and 42 reconstruction FVD on UCF-101, beating the previous SOTA methods by 13% and 26%, respectively. Additionally, we also show that when integrated with Omni Tokenizer, both language model-based approaches and diffusion models can realize advanced visual synthesis performance, underscoring the superiority and versatility of our method.
Researcher Affiliation Collaboration 1Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University 2Shanghai Collaborative Innovation Center on Intelligent Visual Computing, 3Bytedance Inc.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code available at https://github.com/Foundation Vision/Omni Tokenizer.
Open Datasets Yes We evaluate the visual tokenization performance of Omni Tokenizer on both image and video datasets, including Image Net [9], Celeb A-HQ [21], FFHQ [22], Kinetics [23, 6], UCF-101 [46], Moments-in-Time (Mi T) [31], and Something-Something v2 (SSV2) [15].
Dataset Splits Yes Reconstruction FID on Image Net validation split... (Table 1). During the image training stage, we train the model with a fixed image resolution of 256 256. For the joint training stage, we forward the model with image and video data iteratively, with the video sequence length being 17 frames. The spatial resolutions are randomly chosen from 128, 192, 256, 320, and 384 [49].
Hardware Specification Yes We train our model using 8 NVIDIA A100 GPUs for 2 weeks.
Software Dependencies No The paper mentions 'Adam [24] is employed for optimization (β1 = 0.9 and β2 = 0.99)' but does not specify version numbers for Python, PyTorch, or CUDA, which are typical software dependencies for such research.
Experiment Setup Yes Omni Tokenizer adopts a decoupled spatial-temporal architecture consisting of 4 window attention-based spatial layers (window size = 8) and 4 causal attention-based temporal layers. The hidden dimension is 512 and the latent dimension is 8, following Vi T-VQGAN [64]. λ1, λ2, and λ3 are set to 1, 1, 1e-6, respectively. As mentioned in Sec. 3.1.2, the training of Omni Tokenizer follows a progressive training strategy, where both stages last 500K iterations. The learning rate is warmed up to 1e-3 and decayed to 0 using a cosine scheduler. Adam [24] is employed for optimization (β1 = 0.9 and β2 = 0.99).