reproducibilityindex.ai

NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis

Authors: Jian Liang, Chenfei Wu, Xiaowei Hu, Zhe Gan, Jianfeng Wang, Lijuan Wang, Zicheng Liu, Yuejian Fang, Nan Duan

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 Experiments, 4.1 Experiment Setup, 4.2 Evaluation on Visual Synthesis, 4.3 Ablation Studies
Researcher Affiliation	Collaboration	1Peking University 2Microsoft Research Asia 3Microsoft Azure AI
Pseudocode	Yes	Algorithm 1: Training Strategy, Algorithm 2: Inference Strategy
Open Source Code	No	3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No]
Open Datasets	Yes	For image synthesis, we trained unconditional generation model on the LHQ [30]... For video synthesis, we downloaded 120k high-resolution videos from pexels website...
Dataset Splits	No	For image synthesis, we trained unconditional generation model on the LHQ [30], which consists of 90k high-resolution ( 1024^2) nature landsacapes. In addition to support text prompt, we added a caption for each image of LHQ to create a new dataset called LHQC, where 85k as training data and 5k as test data. (No mention of validation split)
Hardware Specification	No	3. If you ran experiments... (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Supplementary material (The main paper refers to supplementary material, but does not state it directly in the provided text).
Software Dependencies	No	The paper mentions using a 'VQGAN model' and 'Adam optimizer' but does not specify software dependencies with version numbers (e.g., Python version, PyTorch version, specific library versions).
Experiment Setup	Yes	Implementation Details. During training, images are cropped into 1024 1024 and videos are cut into 1024 1024 5 with 5fps, then, they will be encoded into discrete tokens using the VQGAN model with a compression rate of 16 and a codebook of 16384. In Sec. 3.2, the rendering size of the three models is 256 256. In Sec. 3.1, based on the nearby sparsity, we set (eh, ew, ef) = (2, 2, 0) for images and (eh, ew, ef) = (1, 1, 3) for videos. We train the model using an Adam optimizer [16] with learning rate of 1e-4, a batch size of 256, and warm-up 5% of total 50 epochs.