FashionR2R: Texture-preserving Rendered-to-Real Image Translation with Diffusion Models
Authors: Rui Hu, Qian He, Gaofeng He, Jiedong Zhuang, Huang Chen, Huafeng Liu, Huamin Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results demonstrate the superiority and effectiveness of our method in rendered-to-real image translation. |
| Researcher Affiliation | Collaboration | 1Zhejiang University, 2Style3D Research |
| Pseudocode | Yes | Algorithm 1 Realistic Image Generation |
| Open Source Code | Yes | Data and code are released in the supplemental materials to preserve anonymity and will be made public available after review process. |
| Open Datasets | Yes | Additionally, we introduce Syn Fashion dataset, featuring high-quality digital clothing images with diverse textures. Extensive experimental results demonstrate the superiority and effectiveness of our method in rendered-to-real image translation. We evaluate our method on a public rendered Face Synthetics dataset [1] and our collected Syn Fashion Dataset |
| Dataset Splits | No | The paper mentions finetuning with 2500 realistic images and using 7500 testing result images, but does not explicitly state a validation dataset split or its size. |
| Hardware Specification | Yes | The finetuning... is conducted on 2 RTX 4090 with a batch size of 6. Based on the finetuned model, we train our negative domain embedding with 2500 rendered images on a single RTX 4090 with a batch size of 1. We test the inference time and resource consumption for a 512x512 image on an RTX 3090. |
| Software Dependencies | No | The paper states using the 'pretrained Stable Diffusion (SD) model' and specifies 'SD v1.5', but does not list other software dependencies with specific version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | The finetuning uses images from i Materialist (Fashion) 2019 FGVC dataset [73], based on the publicly available SD v1.5, and is conducted on 2 RTX 4090 with a batch size of 6. The rendered images are resized to the resolution of 512 × 512. The placeholder embedding size is 75 and the learning rate is 5e-4. During sampling, we perform DDIM sampling with default 50 denoising steps with a denoising strength of 0.3 as default. The γ is set to 0.9 as default, which means that the TAC is performed on the first 90% of sampling steps. Only the attention maps in the first and second shallow layers are used for TAC. |