reproducibilityindex.ai

OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

Authors: Junke Wang, Yi Jiang, Zehuan Yuan, BINGYUE PENG, Zuxuan Wu, Yu-Gang Jiang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that Omni Tokenizer achieves state-of-the-art (SOTA) reconstruction performance on various image and video datasets, e.g., 1.11 reconstruction FID on Image Net and 42 reconstruction FVD on UCF-101, beating the previous SOTA methods by 13% and 26%, respectively. Additionally, we also show that when integrated with Omni Tokenizer, both language model-based approaches and diffusion models can realize advanced visual synthesis performance, underscoring the superiority and versatility of our method.
Researcher Affiliation	Collaboration	1Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University 2Shanghai Collaborative Innovation Center on Intelligent Visual Computing, 3Bytedance Inc.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code available at https://github.com/Foundation Vision/Omni Tokenizer.
Open Datasets	Yes	We evaluate the visual tokenization performance of Omni Tokenizer on both image and video datasets, including Image Net [9], Celeb A-HQ [21], FFHQ [22], Kinetics [23, 6], UCF-101 [46], Moments-in-Time (Mi T) [31], and Something-Something v2 (SSV2) [15].
Dataset Splits	Yes	Reconstruction FID on Image Net validation split... (Table 1). During the image training stage, we train the model with a fixed image resolution of 256 256. For the joint training stage, we forward the model with image and video data iteratively, with the video sequence length being 17 frames. The spatial resolutions are randomly chosen from 128, 192, 256, 320, and 384 [49].
Hardware Specification	Yes	We train our model using 8 NVIDIA A100 GPUs for 2 weeks.
Software Dependencies	No	The paper mentions 'Adam [24] is employed for optimization (β1 = 0.9 and β2 = 0.99)' but does not specify version numbers for Python, PyTorch, or CUDA, which are typical software dependencies for such research.
Experiment Setup	Yes	Omni Tokenizer adopts a decoupled spatial-temporal architecture consisting of 4 window attention-based spatial layers (window size = 8) and 4 causal attention-based temporal layers. The hidden dimension is 512 and the latent dimension is 8, following Vi T-VQGAN [64]. λ1, λ2, and λ3 are set to 1, 1, 1e-6, respectively. As mentioned in Sec. 3.1.2, the training of Omni Tokenizer follows a progressive training strategy, where both stages last 500K iterations. The learning rate is warmed up to 1e-3 and decayed to 0 using a cosine scheduler. Adam [24] is employed for optimization (β1 = 0.9 and β2 = 0.99).