FuseAnyPart: Diffusion-Driven Facial Parts Swapping via Multiple Reference Images
Authors: zheng yu, Yaohua Wang, Siying Cui, Aixi Zhang, Wei-Long Zheng, Senzhang Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments qualitatively and quantitatively validate the superiority and robustness of Fuse Any Part. Source codes are available at https://github.com/Thomas-wyh/Fuse Any Part. 4 Experiment Dataset. We train our model on the Celeb A-HQ [11] dataset. The Celeb A-HQ dataset contains 30,000 high-resolution face images of celebrities widely used for face generation and face swapping tasks. This dataset has been pre-processed and aligned, and is available in three different resolutions. In our experiments, we use the 1024 1024 resolution. Our evaluation set is sampled from the Face Forensics++ [25] dataset, which contains 1,000 videos. We randomly sample 10 frames from each video and obtain 10,000 images. |
| Researcher Affiliation | Collaboration | Zheng Yu Shanghai Jiao Tong University & Alibaba Group cs-yuzheng@sjtu.edu.cn Yaohua Wang * Alibaba Group xiachen.wyh@alibaba-inc.com Siying Cui Peking University & Alibaba Group cuisiying.csy@alibaba-inc.com Aixi Zhang Alibaba Group aixi.zhax@alibaba-inc.com Wei-Long Zheng Shanghai Jiao Tong University weilong@sjtu.edu.cn Senzhang Wang Central South University szwang@csu.edu.cn |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Source codes are available at https://github.com/Thomas-wyh/Fuse Any Part. |
| Open Datasets | Yes | We train our model on the Celeb A-HQ [11] dataset. The Celeb A-HQ dataset contains 30,000 high-resolution face images of celebrities widely used for face generation and face swapping tasks. This dataset has been pre-processed and aligned, and is available in three different resolutions. In our experiments, we use the 1024 1024 resolution. Our evaluation set is sampled from the Face Forensics++ [25] dataset |
| Dataset Splits | No | The paper states the training and evaluation (test) datasets but does not explicitly describe a separate validation split or how it was used for hyperparameter tuning. It mentions an 'evaluation set' which is then used for quantitative comparisons, implying it serves as a test set. |
| Hardware Specification | Yes | We train our model on 16 NVIDIA A100 GPUs (80GB) with a batch size of 16 per GPU using the Adam W optimizer [16] with a constant learning rate of 1e-4 and weight decay of 0.01. |
| Software Dependencies | Yes | Our implementation is based on Hugging Face diffusers [30] library and we use Stable Diffusion v1-5 [24] and Open AI s clip-vit-large-path14 vison model [22]. |
| Experiment Setup | Yes | We train our model on 16 NVIDIA A100 GPUs (80GB) with a batch size of 16 per GPU using the Adam W optimizer [16] with a constant learning rate of 1e-4 and weight decay of 0.01. During training, facial part reference images are randomly sampled from images with the same ID, and the target image is consistent with the face reference image. During the inference stage, we use the DDIM [28] sampler with 50 steps and set λ = 1.0. Since we do not use a text prompt, we set the text prompt to empty. |