The Image Local Autoregressive Transformer
Authors: Chenjie Cao, Yuxin Hong, Xiang Li, Chengrong Wang, Chengming Xu, Yanwei Fu, Xiangyang Xue
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our i LAT is evaluated on various locally guided image syntheses, such as pose-guided person image synthesis and face editing. Both quantitative and qualitative results show the efficacy of our model. In this section, we present experimental results on pose-guided generation of Penn Action (PA) [52] and Synthetic Deep Fashion (SDF) [26], face editing of Celeb A [27] and FFHQ [20] compared with other competitors and variants of i LAT. |
| Researcher Affiliation | Academia | Chenjie Cao, Yuxin Hong, Xiang Li, Chengrong Wang, Chengming Xu, Yanwei Fu , Xiangyang Xue School of Data Science Fudan University {20110980001,yanweifu}@fudan.edu.cn Corresponding author. Dr. Fu is also with Fudan ISTBI ZJNU Algorithm Centre for Brain-inspired Intelligence, Zhejiang Normal University, Jinhua, China. |
| Pseudocode | No | The paper includes architectural diagrams and mathematical formulations but does not contain explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states 'The images used in this paper are all open sourced.' but does not provide an explicit statement or link to the source code for the i LAT methodology itself. |
| Open Datasets | Yes | In this section, we present experimental results on pose-guided generation of Penn Action (PA) [52] and Synthetic Deep Fashion (SDF) [26], face editing of Celeb A [27] and FFHQ [20] compared with other competitors and variants of i LAT. Datasets. For the pose guiding, PA dataset [52], which contains 2,326 video sequences of 15 action classes in non-iconic views is used in this section. The face editing dataset consists of Flickr-Faces-HQ dataset (FFHQ) [20] and Celeb A-HQ [27]. |
| Dataset Splits | No | For the pose guiding, PA dataset [52]... We randomly gather pairs of the same video sequence in the training phase dynamically and select 1,000 testing pairs in the remaining videos. Besides, the SDF is synthesized with Deep Fashion [26] images as foregrounds and Places2 [54] images as backgrounds. Since only a few images of Deep Fashion have related exact segmentation masks, we select 4,500/285 pairs from it for training and testing respectively. The face editing dataset consists of Flickr-Faces-HQ dataset (FFHQ) [20] and Celeb A-HQ [27]. FFHQ is a high-quality image dataset with 70,000 human faces. We resize them from 1024 1024 into 256 256 and use 68,000 of them for the training. No explicit validation split with percentages or counts is provided. |
| Hardware Specification | No | The paper does not specify the exact hardware (e.g., GPU models, CPU types, or memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'Our method is implemented in Py Torch' and the use of 'Adam optimizer [21]' but does not provide specific version numbers for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | For the TS-VQGAN training, we use the Adam optimizer [21] with β1= 0.5 and β2 = 0.9. TS-VQGAN is trained with 150k steps without masks at first, and then it is trained with another 150k steps with masks in batch size 16. The initial learning rates of pose guiding and face editing are 8e-5 and 2e-4 respectively, which are decayed by 0.5 for every 50k steps. For the transformer training, we use Adam with β1 = 0.9 and β2 = 0.95 with initial learning rate 5e-5 and 0.01 weight decay. Besides, we warmup the learning rate with the first 10k steps, then it is linearly decayed to 0 for 300k iterations with batch size 16. |