ReplaceAnything3D: Text-Guided Object Replacement in 3D Scenes with Compositional Scene Representations
Authors: Edward Bartrum, Thu H. Nguyen-Phuoc, Christopher Xie, Zhengqin Li, Numair Khan, Armen Avetisyan, Douglas Lanman, Lei Xiao
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the versatility of RAM3D by applying it to various realistic 3D scene types, showcasing results of modified objects that blend in seamlessly with the scene without impacting its overall integrity. We conduct experiments on real 3D scenes varying in complexity: forward-facing scenes, 360 scenes and human avatar. |
| Researcher Affiliation | Collaboration | Edward Bartrum University College London London, England edward.bartrum.18@ucl.ac.uk Thu Nguyen-Phuoc Meta Reality Labs London, England thunp@meta.com Chris Xie Meta Reality Labs Redmond, Washington chrisdxie@meta.com Zhengqin Li Meta Reality Labs Redmond, Washington zhl@meta.com Numair Khan Meta Reality Labs Redmond, Washington numairkhan@meta.com Armen Avetisyan Meta Reality Labs London, England aavetisyan@meta.com Douglas Lanman Meta Reality Labs Redmond, Washington douglas.lanman@meta.com Lei Xiao Meta Reality Labs Redmond, Washington lei.xiao@meta.com |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | No | Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We base our method on open-source pre-trained models without proposing new datasets. Comprehensive implementation and necessary infrastructure details for our method are provided in the supplementary material. |
| Open Datasets | Yes | For forward-facing scenes, we show results for the STATUE and RED-NET scene from SPIn-Ne RF dataset [27], as well as the FERN scene from Ne RF [39]. For 360 scene, we show results from the GARDEN scene from Mip-Ne RF 360 [44]. For the avatar result, we use the FACE dataset from Instruct-Ne RF2Ne RF [3]. |
| Dataset Splits | No | No explicit training/test/validation dataset splits (e.g., percentages or counts) are provided for RAM3D's training stages. The paper mentions using existing datasets and states 'Each dataset is downsampled to have a shortest image side-length (height) equal to 512', and later, 'we create a new training set using the edited images and camera poses from the original scene, and reconstruct the modified 3D scene using any choice of 3D representations for novel view synthesis', but does not detail the splits used for these processes. |
| Hardware Specification | Yes | Training takes approximately 12 hours on a single 32GB V100 GPU. |
| Software Dependencies | No | No specific software versions are provided. The paper mentions software and libraries such as 'nerf-pytorch [59]', 'VGG-16 network [60]', 'Adam optimiser [61]', 'nerf-studio [62]', and 'Gaussian Splatting [36]', but does not specify their version numbers. |
| Experiment Setup | Yes | We use the Adam optimiser [61] with a learning rate of 1e-3, which is scaled up by 10 for the Instant-NGP hash encoding parameters. ... Each dataset is downsampled to have a shortest image side-length (height) equal to 512... We use a CFG scale of 30 during the Replace stage, and 7.5 during the Erase stage. We also adopt the Hi FA noise-level schedule, with t_min = 0.2, t_max = 0.98, and use stochasticity hyperparameter η = 0. ... We render the RAM3D radiance function using a coarse-to-fine sampling strategy, with 128 coarse and 128 fine raysamples. ... We train RAM3D for 20,000 training steps, during both Erase and Replace training stages. |