Language Model Beats Diffusion - Tokenizer is key to visual generation
Authors: Lijun Yu, Jose Lezama, Nitesh Bharadwaj Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A Ross, Lu Jiang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This section empirically verifies the proposed tokenizer across three distinct tasks: video and image generation, video compression, and action recognition. |
| Researcher Affiliation | Collaboration | :Google, ;Carnegie Mellon University |
| Pseudocode | No | The paper does not contain any sections explicitly labeled 'Pseudocode' or 'Algorithm', nor are there structured steps formatted as pseudocode. |
| Open Source Code | No | The paper does not explicitly state that the source code for the described methodology is open-source or provide a direct link to a code repository. A link to 'qualitative samples' is provided, but not for code. |
| Open Datasets | Yes | Datasets. We use Kinetics-600 (K600) (Carreira et al., 2018) and UCF-101 (Soomro et al., 2012) for video generation experiments, along with Image Net (Deng et al., 2009) for image generaton. In addition, MCL-JCV (Wang et al., 2016) is used as the testbed for video compression, with Kinetics-400 (K400) (Kay et al., 2017) and SSv2 (Goyal et al., 2017) for video understanding. |
| Dataset Splits | No | The paper mentions training and test sets but does not explicitly specify the use of a validation split or its size/proportion. |
| Hardware Specification | No | The paper mentions 'GPU/TPU optimization' generally and that results were obtained 'with TPUs' but does not specify any particular models or configurations of GPUs, CPUs, or TPUs used for the experiments. |
| Software Dependencies | No | The paper describes various model architectures and frameworks used (e.g., VQ-VAE, MLM), but it does not specify software dependencies with version numbers (e.g., Python 3.x, PyTorch x.x, TensorFlow x.x). |
| Experiment Setup | Yes | Implementation details We follow the tokenizer training setting and hyperparameters in (Yu et al., 2023a), unless stated otherwise. LFQ is used, which eliminates the codebook embedding, to increase the default codebook size to K 218. The weight of Lentropy follows an annealing schedule with a 3ˆ higher starting point and linearly decays to a fixed value of 0.1 within 2k steps. ... Video input: 17 frames, frame stride 1, 128 ˆ 128 resolution. Base channels: 128. ... Learning rate: 10 4. ... Optimizer: Adam with β1 0 and β2 0.99. EMA model decay rate: 0.999. Batch size: 256. |