Collaborative Score Distillation for Consistent Visual Editing

Authors: Subin Kim, Kyungmin Lee, June Suk Choi, Jongheon Jeong, Kihyuk Sohn, Jinwoo Shin

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show the effectiveness of CSD in a variety of editing tasks, encompassing the visual editing of panorama images, videos, and 3D scenes. Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.1
Researcher Affiliation Collaboration Subin Kim ,1 Kyungmin Lee ,1 June Suk Choi1 Jongheon Jeong1 Kihyuk Sohn2 Jinwoo Shin1 1KAIST 2Google Research
Pseudocode No The paper describes its methods using mathematical equations and descriptive text, but it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No 1Visualizations are available at the website https://subin-kim-cv.github.io/CSD. The website linked in the paper states 'Our code will be publicly available at github.com/subin-kim-cv/CSD', which indicates a future release, not current availability.
Open Datasets Yes For the video editing experiments, we use video sequences from the popular DAVIS [33] dataset at a resolution of 1920 × 1080.
Dataset Splits No Following Instruct-NeRF2NeRF [39], we first pretrain NeRF using the nerfacto model from NeRFStudio [57], training it for 30,000 steps. Next, we re-initialize the optimizer and finetune the pre-trained NeRF model with edited train views. In contrast to Instruct-NeRF2NeRF, which edits one train view with Instruct-Pix2Pix after every 10 steps of update, we edit a batch of train views (batch size of 16) with CSD-Edit after every 2000 steps of update. The batch is randomly selected among the train views without replacement. The paper mentions 'train views' but does not explicitly provide details about a validation set or how the dataset used for finetuning NeRFs is formally split into train/validation/test sets for reproducibility.
Hardware Specification Yes All experiments are conducted on AMD EPYC 7V13 64-Core Processor and a single NVIDIA A100 80GB.
Software Dependencies No For the experiments with CSD-Edit, we use the publicly available pre-trained model of Instruct-Pix2Pix [14]2 by default. We perform CSD-Edit optimization on the output space of Stable Diffusion [4] autoencoder. Throughout the experiments, we use Open CLIP [56] ViT-big G-14 model for evaluation. Following Instruct-NeRF2NeRF [39], we first pretrain NeRF using the nerfacto model from NeRFStudio [57]... We use Adan [62] optimizer... The paper mentions various software components and models but does not provide specific version numbers for them.
Experiment Setup Yes We set tmin = 0.2 and tmax = 0.5, where original SDS optimization for Dream Fusion used tmin = 0.2 and tmax = 0.98. This is because we do not generally require a large scale of noise in editing. We use the guidance scale ωy ∈ [3.0, 15.0] and image guidance scale ωs ∈ [1.5, 5.0]. We use learning rate [0.25, 2] and optimize them for [200, 500] iterations. We use Adan [62] optimizer with learning rate warmup over 2000 steps from 10−9 to 2 × 10−3 followed by cosine decay down to 10−6. We use batch size of 4 and optimize for 10000 steps in total...