ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis

Authors: Zanlin Ni, Yulin Wang, Renping Zhou, Yizeng Han, Jiayi Guo, Zhiyuan Liu, Yuan Yao, Gao Huang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on Image Net-2562 & 5122 and MSCOCO validate the effectiveness of ENAT.
Researcher Affiliation Collaboration Zanlin Ni1 Yulin Wang1 Renping Zhou1 Yizeng Han1 Jiayi Guo1 Zhiyuan Liu1 Yuan Yao2 Gao Huang1 1Tsinghua University 2National University of Singapore
Pseudocode No The paper describes algorithms and processes textually and with diagrams (e.g., Figure 4), but it does not include a formal pseudocode block or algorithm listing.
Open Source Code Yes Code and pre-trained models will be released at https://github.com/Leap Lab THU/ENAT.
Open Datasets Yes Experiments on Image Net-2562 & 5122 and MSCOCO validate the effectiveness of ENAT.
Dataset Splits Yes Our evaluation on FID follows the same evaluation protocol as [10, 3, 49]. We adopt the pre-computed dataset statistics from [3] and generate 50k samples for Image Net (30k for MS-COCO) to compute the statistics for the generated samples...
Hardware Specification Yes All our experiments are conducted with 8 A100 80G GPUs.
Software Dependencies No The paper mentions utilizing a pretrained VQGAN [13] but does not specify software versions or library dependencies used for implementation or experiments.
Experiment Setup Yes For Image Net 256 256, we use a batch size of 2048 and a learning rate of 4e-4. For Image Net 512 512, to manage the increased sequence length, we reduce the batch size to 512 and linearly scale down the learning rate to 1e-4. For MS-COCO, we train for 150k steps instead of the 1000k steps used in [3]. For our ablation studies in Sec. 5.2 and explorative experiments in Sec. 4, we train the models for 300k steps instead of the 500k steps used in [3], while keeping the other settings the same as above.