FashionR2R: Texture-preserving Rendered-to-Real Image Translation with Diffusion Models

Authors: Rui Hu, Qian He, Gaofeng He, Jiedong Zhuang, Huang Chen, Huafeng Liu, Huamin Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results demonstrate the superiority and effectiveness of our method in rendered-to-real image translation.
Researcher Affiliation Collaboration 1Zhejiang University, 2Style3D Research
Pseudocode Yes Algorithm 1 Realistic Image Generation
Open Source Code Yes Data and code are released in the supplemental materials to preserve anonymity and will be made public available after review process.
Open Datasets Yes Additionally, we introduce Syn Fashion dataset, featuring high-quality digital clothing images with diverse textures. Extensive experimental results demonstrate the superiority and effectiveness of our method in rendered-to-real image translation. We evaluate our method on a public rendered Face Synthetics dataset [1] and our collected Syn Fashion Dataset
Dataset Splits No The paper mentions finetuning with 2500 realistic images and using 7500 testing result images, but does not explicitly state a validation dataset split or its size.
Hardware Specification Yes The finetuning... is conducted on 2 RTX 4090 with a batch size of 6. Based on the finetuned model, we train our negative domain embedding with 2500 rendered images on a single RTX 4090 with a batch size of 1. We test the inference time and resource consumption for a 512x512 image on an RTX 3090.
Software Dependencies No The paper states using the 'pretrained Stable Diffusion (SD) model' and specifies 'SD v1.5', but does not list other software dependencies with specific version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes The finetuning uses images from i Materialist (Fashion) 2019 FGVC dataset [73], based on the publicly available SD v1.5, and is conducted on 2 RTX 4090 with a batch size of 6. The rendered images are resized to the resolution of 512 × 512. The placeholder embedding size is 75 and the learning rate is 5e-4. During sampling, we perform DDIM sampling with default 50 denoising steps with a denoising strength of 0.3 as default. The γ is set to 0.9 as default, which means that the TAC is performed on the first 90% of sampling steps. Only the attention maps in the first and second shallow layers are used for TAC.