FiT: Flexible Vision Transformer for Diffusion Model
Authors: Zeyu Lu, Zidong Wang, Di Huang, Chengyue Wu, Xihui Liu, Wanli Ouyang, Lei Bai
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments demonstrate the exceptional performance of Fi T across a broad range of resolutions. Repository available at https://github.com/whlzy/Fi T. |
| Researcher Affiliation | Collaboration | 1Shanghai Artificial Intelligence Laboratory 2Shanghai Jiao Tong University 3Tsinghua University 4Sydney University 5The University of Hong Kong. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Repository available at https://github.com/whlzy/Fi T. |
| Open Datasets | Yes | We train class-conditional latent Fi T models under predetermined maximum resolution limitation, HW <= 256^2 (equivalent to token length L <= 256), on the Image Net (Deng et al., 2009) dataset. |
| Dataset Splits | No | The paper mentions general training settings (learning rate, batch size, EMA, diffusion hyper-parameters) but does not provide specific dataset split information (percentages, counts) for training, validation, or test sets, nor does it cite predefined splits with specification. |
| Hardware Specification | No | The paper mentions 'GPU hardware' as a constraint but does not provide specific details on the GPU models (e.g., NVIDIA A100), CPU models, or cloud computing instance types used for the experiments. |
| Software Dependencies | No | The paper mentions software like AdamW, TensorFlow, and Stable Diffusion, but it does not specify exact version numbers for these software components, which is required for reproducible dependency descriptions. |
| Experiment Setup | Yes | We use the same training setting as Di T: a constant learning rate of 1e-4 using AdamW (Loshchilov & Hutter, 2017), no weight decay, and a batch size of 256. Following common practice in the generative modeling literature, we adopt an exponential moving average (EMA) of model weights over training with a decay of 0.9999. All results are reported using the EMA model. We retain the same diffusion hyper-parameters as Di T. |