The Image Local Autoregressive Transformer

Authors: Chenjie Cao, Yuxin Hong, Xiang Li, Chengrong Wang, Chengming Xu, Yanwei Fu, Xiangyang Xue

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our i LAT is evaluated on various locally guided image syntheses, such as pose-guided person image synthesis and face editing. Both quantitative and qualitative results show the efficacy of our model. In this section, we present experimental results on pose-guided generation of Penn Action (PA) [52] and Synthetic Deep Fashion (SDF) [26], face editing of Celeb A [27] and FFHQ [20] compared with other competitors and variants of i LAT.
Researcher Affiliation Academia Chenjie Cao, Yuxin Hong, Xiang Li, Chengrong Wang, Chengming Xu, Yanwei Fu , Xiangyang Xue School of Data Science Fudan University {20110980001,yanweifu}@fudan.edu.cn Corresponding author. Dr. Fu is also with Fudan ISTBI ZJNU Algorithm Centre for Brain-inspired Intelligence, Zhejiang Normal University, Jinhua, China.
Pseudocode No The paper includes architectural diagrams and mathematical formulations but does not contain explicit pseudocode or algorithm blocks.
Open Source Code No The paper states 'The images used in this paper are all open sourced.' but does not provide an explicit statement or link to the source code for the i LAT methodology itself.
Open Datasets Yes In this section, we present experimental results on pose-guided generation of Penn Action (PA) [52] and Synthetic Deep Fashion (SDF) [26], face editing of Celeb A [27] and FFHQ [20] compared with other competitors and variants of i LAT. Datasets. For the pose guiding, PA dataset [52], which contains 2,326 video sequences of 15 action classes in non-iconic views is used in this section. The face editing dataset consists of Flickr-Faces-HQ dataset (FFHQ) [20] and Celeb A-HQ [27].
Dataset Splits No For the pose guiding, PA dataset [52]... We randomly gather pairs of the same video sequence in the training phase dynamically and select 1,000 testing pairs in the remaining videos. Besides, the SDF is synthesized with Deep Fashion [26] images as foregrounds and Places2 [54] images as backgrounds. Since only a few images of Deep Fashion have related exact segmentation masks, we select 4,500/285 pairs from it for training and testing respectively. The face editing dataset consists of Flickr-Faces-HQ dataset (FFHQ) [20] and Celeb A-HQ [27]. FFHQ is a high-quality image dataset with 70,000 human faces. We resize them from 1024 1024 into 256 256 and use 68,000 of them for the training. No explicit validation split with percentages or counts is provided.
Hardware Specification No The paper does not specify the exact hardware (e.g., GPU models, CPU types, or memory) used for running the experiments.
Software Dependencies No The paper mentions 'Our method is implemented in Py Torch' and the use of 'Adam optimizer [21]' but does not provide specific version numbers for PyTorch or any other software dependencies.
Experiment Setup Yes For the TS-VQGAN training, we use the Adam optimizer [21] with β1= 0.5 and β2 = 0.9. TS-VQGAN is trained with 150k steps without masks at first, and then it is trained with another 150k steps with masks in batch size 16. The initial learning rates of pose guiding and face editing are 8e-5 and 2e-4 respectively, which are decayed by 0.5 for every 50k steps. For the transformer training, we use Adam with β1 = 0.9 and β2 = 0.95 with initial learning rate 5e-5 and 0.01 weight decay. Besides, we warmup the learning rate with the first 10k steps, then it is linearly decayed to 0 for 300k iterations with batch size 16.