reproducibilityindex.ai

Language Model Beats Diffusion - Tokenizer is key to visual generation

Authors: Lijun Yu, Jose Lezama, Nitesh Bharadwaj Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A Ross, Lu Jiang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This section empirically veriﬁes the proposed tokenizer across three distinct tasks: video and image generation, video compression, and action recognition.
Researcher Affiliation	Collaboration	:Google, ;Carnegie Mellon University
Pseudocode	No	The paper does not contain any sections explicitly labeled 'Pseudocode' or 'Algorithm', nor are there structured steps formatted as pseudocode.
Open Source Code	No	The paper does not explicitly state that the source code for the described methodology is open-source or provide a direct link to a code repository. A link to 'qualitative samples' is provided, but not for code.
Open Datasets	Yes	Datasets. We use Kinetics-600 (K600) (Carreira et al., 2018) and UCF-101 (Soomro et al., 2012) for video generation experiments, along with Image Net (Deng et al., 2009) for image generaton. In addition, MCL-JCV (Wang et al., 2016) is used as the testbed for video compression, with Kinetics-400 (K400) (Kay et al., 2017) and SSv2 (Goyal et al., 2017) for video understanding.
Dataset Splits	No	The paper mentions training and test sets but does not explicitly specify the use of a validation split or its size/proportion.
Hardware Specification	No	The paper mentions 'GPU/TPU optimization' generally and that results were obtained 'with TPUs' but does not specify any particular models or configurations of GPUs, CPUs, or TPUs used for the experiments.
Software Dependencies	No	The paper describes various model architectures and frameworks used (e.g., VQ-VAE, MLM), but it does not specify software dependencies with version numbers (e.g., Python 3.x, PyTorch x.x, TensorFlow x.x).
Experiment Setup	Yes	Implementation details We follow the tokenizer training setting and hyperparameters in (Yu et al., 2023a), unless stated otherwise. LFQ is used, which eliminates the codebook embedding, to increase the default codebook size to K 218. The weight of Lentropy follows an annealing schedule with a 3ˆ higher starting point and linearly decays to a ﬁxed value of 0.1 within 2k steps. ... Video input: 17 frames, frame stride 1, 128 ˆ 128 resolution. Base channels: 128. ... Learning rate: 10 4. ... Optimizer: Adam with β1 0 and β2 0.99. EMA model decay rate: 0.999. Batch size: 256.