Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
SceneScape: Text-Driven Consistent Scene Generation
Authors: Rafail Fridman, Amit Abecasis, Yoni Kasten, Tali Dekel
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We thoroughly evaluate and ablate our method, demonstrating a significant improvement in quality and 3D consistency over existing methods. |
| Researcher Affiliation | Collaboration | Rafail Fridman Weizmann Institute of Science EMAIL Amit Abecasis Weizmann Institute of Science EMAIL Yoni Kasten NVIDIA Research EMAIL Tali Dekel Weizmann Institute of Science EMAIL |
| Pseudocode | No | The paper describes its method through text and diagrams but does not include formal pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a project page link (https://scenescape.github.io/) but not a direct link to a source-code repository for the methodology. |
| Open Datasets | Yes | To compare to GEN-1 we used the Real Estate10K dataset [66], consisting of curated Internet videos and corresponding camera poses. |
| Dataset Splits | No | The paper describes training and testing procedures but does not explicitly provide details about a dedicated validation dataset split, its size, or its percentage. |
| Hardware Specification | Yes | Synthesizing 50 frame-long videos with our full method takes approximately 2.5 hours on an NVIDIA Tesla V100 GPU. |
| Software Dependencies | No | The paper mentions several models and tools (Stable Diffusion, DDIM scheduler, Mi Da S-DPT Large, PyTorch3D) but does not provide specific version numbers for these software dependencies or other libraries. |
| Experiment Setup | Yes | For each generated frame, we finetune it for 300 epochs, using Adam optimizer [25] with a learning rate of 1e 7. Additionally, we revert the weights of the depth prediction model to the initial state, as discussed in Sec. 3.3. We finetune the LDM decoder for 100 epochs on each generation step using Adam optimizer with a learning rate of 1e 4. |