NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis
Authors: Jian Liang, Chenfei Wu, Xiaowei Hu, Zhe Gan, Jianfeng Wang, Lijuan Wang, Zicheng Liu, Yuejian Fang, Nan Duan
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments, 4.1 Experiment Setup, 4.2 Evaluation on Visual Synthesis, 4.3 Ablation Studies |
| Researcher Affiliation | Collaboration | 1Peking University 2Microsoft Research Asia 3Microsoft Azure AI |
| Pseudocode | Yes | Algorithm 1: Training Strategy, Algorithm 2: Inference Strategy |
| Open Source Code | No | 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] |
| Open Datasets | Yes | For image synthesis, we trained unconditional generation model on the LHQ [30]... For video synthesis, we downloaded 120k high-resolution videos from pexels website... |
| Dataset Splits | No | For image synthesis, we trained unconditional generation model on the LHQ [30], which consists of 90k high-resolution ( 1024^2) nature landsacapes. In addition to support text prompt, we added a caption for each image of LHQ to create a new dataset called LHQC, where 85k as training data and 5k as test data. (No mention of validation split) |
| Hardware Specification | No | 3. If you ran experiments... (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Supplementary material (The main paper refers to supplementary material, but does not state it directly in the provided text). |
| Software Dependencies | No | The paper mentions using a 'VQGAN model' and 'Adam optimizer' but does not specify software dependencies with version numbers (e.g., Python version, PyTorch version, specific library versions). |
| Experiment Setup | Yes | Implementation Details. During training, images are cropped into 1024 1024 and videos are cut into 1024 1024 5 with 5fps, then, they will be encoded into discrete tokens using the VQGAN model with a compression rate of 16 and a codebook of 16384. In Sec. 3.2, the rendering size of the three models is 256 256. In Sec. 3.1, based on the nearby sparsity, we set (eh, ew, ef) = (2, 2, 0) for images and (eh, ew, ef) = (1, 1, 3) for videos. We train the model using an Adam optimizer [16] with learning rate of 1e-4, a batch size of 256, and warm-up 5% of total 50 epochs. |