Text-To-4D Dynamic Scene Generation
Authors: Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, Yaniv Taigman
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our approach using comprehensive quantitative and qualitative experiments and show an improvement over previously established internal baselines. We conduct a comprehensive set of experiments, including ablation studies, using both quantitative and qualitative metrics to reveal the technical decisions made during the development of our method. |
| Researcher Affiliation | Industry | Uriel Singer * Shelly Sheynin * Adam Polyak * Oron Ashual Iurii Makarov Filippos Kokkinos Naman Goyal Andrea Vedaldi Devi Parikh Justin Johnson Yaniv Taigman *Equal contribution. Meta AI. Correspondence to: Uriel Singer <urielsinger@meta.com>, Shelly Sheynin <shellysheynin@meta.com>, Adam Polyak <adampolyak@meta.com>. |
| Pseudocode | No | The paper describes the methods in text and diagrams but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a link to view generated samples (make-a-video3d.github.io) but does not explicitly state that the source code for the methodology is open-source or provide a link to a code repository. |
| Open Datasets | Yes | We evaluated all baselines and ablations on the text prompts splits which were used in (Singer et al., 2022). |
| Dataset Splits | Yes | We evaluated all baselines and ablations on the text prompts splits which were used in (Singer et al., 2022). |
| Hardware Specification | Yes | All runtimes were measured on 8 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions software like PyTorch (implicitly), CLIP, COLMAP, and Stable Diffusion, but does not provide specific version numbers for these or other ancillary software components. |
| Experiment Setup | Yes | Unless otherwise noted, we use a batch size of 8 and sample 128 points along each ray. ... The static scene representation is trained on rendered images of 64 64 for 2000 iterations ... The dynamic stage is trained on rendered videos of 64 64 16 for 5,000 iterations ... Lastly, the super resolution phase is trained on rendered videos of 256 256 16 for another 2000 iterations ... We train the model using the Adam optimizer, with cosine decay scheduler, starting from learning rate of 1e-3. Where σ is a function of the training step, ts. In order to anneal the bias for M = 5000 training steps from a minimum value σmin = 0.2 to a maximum value σmax = 2.0, we define a linear function as follows: σ(ts) = min(σmax, σmin + (σmax σmin) ts/M) |