Text-To-4D Dynamic Scene Generation

Authors: Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, Yaniv Taigman

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our approach using comprehensive quantitative and qualitative experiments and show an improvement over previously established internal baselines. We conduct a comprehensive set of experiments, including ablation studies, using both quantitative and qualitative metrics to reveal the technical decisions made during the development of our method.
Researcher Affiliation Industry Uriel Singer * Shelly Sheynin * Adam Polyak * Oron Ashual Iurii Makarov Filippos Kokkinos Naman Goyal Andrea Vedaldi Devi Parikh Justin Johnson Yaniv Taigman *Equal contribution. Meta AI. Correspondence to: Uriel Singer <urielsinger@meta.com>, Shelly Sheynin <shellysheynin@meta.com>, Adam Polyak <adampolyak@meta.com>.
Pseudocode No The paper describes the methods in text and diagrams but does not provide structured pseudocode or algorithm blocks.
Open Source Code No The paper provides a link to view generated samples (make-a-video3d.github.io) but does not explicitly state that the source code for the methodology is open-source or provide a link to a code repository.
Open Datasets Yes We evaluated all baselines and ablations on the text prompts splits which were used in (Singer et al., 2022).
Dataset Splits Yes We evaluated all baselines and ablations on the text prompts splits which were used in (Singer et al., 2022).
Hardware Specification Yes All runtimes were measured on 8 NVIDIA A100 GPUs.
Software Dependencies No The paper mentions software like PyTorch (implicitly), CLIP, COLMAP, and Stable Diffusion, but does not provide specific version numbers for these or other ancillary software components.
Experiment Setup Yes Unless otherwise noted, we use a batch size of 8 and sample 128 points along each ray. ... The static scene representation is trained on rendered images of 64 64 for 2000 iterations ... The dynamic stage is trained on rendered videos of 64 64 16 for 5,000 iterations ... Lastly, the super resolution phase is trained on rendered videos of 256 256 16 for another 2000 iterations ... We train the model using the Adam optimizer, with cosine decay scheduler, starting from learning rate of 1e-3. Where σ is a function of the training step, ts. In order to anneal the bias for M = 5000 training steps from a minimum value σmin = 0.2 to a maximum value σmax = 2.0, we define a linear function as follows: σ(ts) = min(σmax, σmin + (σmax σmin) ts/M)