ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis
Authors: Zanlin Ni, Yulin Wang, Renping Zhou, Yizeng Han, Jiayi Guo, Zhiyuan Liu, Yuan Yao, Gao Huang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on Image Net-2562 & 5122 and MSCOCO validate the effectiveness of ENAT. |
| Researcher Affiliation | Collaboration | Zanlin Ni1 Yulin Wang1 Renping Zhou1 Yizeng Han1 Jiayi Guo1 Zhiyuan Liu1 Yuan Yao2 Gao Huang1 1Tsinghua University 2National University of Singapore |
| Pseudocode | No | The paper describes algorithms and processes textually and with diagrams (e.g., Figure 4), but it does not include a formal pseudocode block or algorithm listing. |
| Open Source Code | Yes | Code and pre-trained models will be released at https://github.com/Leap Lab THU/ENAT. |
| Open Datasets | Yes | Experiments on Image Net-2562 & 5122 and MSCOCO validate the effectiveness of ENAT. |
| Dataset Splits | Yes | Our evaluation on FID follows the same evaluation protocol as [10, 3, 49]. We adopt the pre-computed dataset statistics from [3] and generate 50k samples for Image Net (30k for MS-COCO) to compute the statistics for the generated samples... |
| Hardware Specification | Yes | All our experiments are conducted with 8 A100 80G GPUs. |
| Software Dependencies | No | The paper mentions utilizing a pretrained VQGAN [13] but does not specify software versions or library dependencies used for implementation or experiments. |
| Experiment Setup | Yes | For Image Net 256 256, we use a batch size of 2048 and a learning rate of 4e-4. For Image Net 512 512, to manage the increased sequence length, we reduce the batch size to 512 and linearly scale down the learning rate to 1e-4. For MS-COCO, we train for 150k steps instead of the 1000k steps used in [3]. For our ablation studies in Sec. 5.2 and explorative experiments in Sec. 4, we train the models for 300k steps instead of the 500k steps used in [3], while keeping the other settings the same as above. |