NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis

Authors: Jian Liang, Chenfei Wu, Xiaowei Hu, Zhe Gan, Jianfeng Wang, Lijuan Wang, Zicheng Liu, Yuejian Fang, Nan Duan

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments, 4.1 Experiment Setup, 4.2 Evaluation on Visual Synthesis, 4.3 Ablation Studies
Researcher Affiliation Collaboration 1Peking University 2Microsoft Research Asia 3Microsoft Azure AI
Pseudocode Yes Algorithm 1: Training Strategy, Algorithm 2: Inference Strategy
Open Source Code No 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No]
Open Datasets Yes For image synthesis, we trained unconditional generation model on the LHQ [30]... For video synthesis, we downloaded 120k high-resolution videos from pexels website...
Dataset Splits No For image synthesis, we trained unconditional generation model on the LHQ [30], which consists of 90k high-resolution ( 1024^2) nature landsacapes. In addition to support text prompt, we added a caption for each image of LHQ to create a new dataset called LHQC, where 85k as training data and 5k as test data. (No mention of validation split)
Hardware Specification No 3. If you ran experiments... (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Supplementary material (The main paper refers to supplementary material, but does not state it directly in the provided text).
Software Dependencies No The paper mentions using a 'VQGAN model' and 'Adam optimizer' but does not specify software dependencies with version numbers (e.g., Python version, PyTorch version, specific library versions).
Experiment Setup Yes Implementation Details. During training, images are cropped into 1024 1024 and videos are cut into 1024 1024 5 with 5fps, then, they will be encoded into discrete tokens using the VQGAN model with a compression rate of 16 and a codebook of 16384. In Sec. 3.2, the rendering size of the three models is 256 256. In Sec. 3.1, based on the nearby sparsity, we set (eh, ew, ef) = (2, 2, 0) for images and (eh, ew, ef) = (1, 1, 3) for videos. We train the model using an Adam optimizer [16] with learning rate of 1e-4, a batch size of 256, and warm-up 5% of total 50 epochs.