DyST: Towards Dynamic Neural Scene Representations on Real-World Videos
Authors: Maximilian Seitzer, Sjoerd van Steenkiste, Thomas Kipf, Klaus Greff, Mehdi S. M. Sajjadi
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 EXPERIMENTSUnless specified otherwise, Dy ST is always co-trained on our synthetic Dy SO dataset and on real world videos as described in Secs. 3.1 and 3.2. As a source of a real world videos, we use the Something-Something v2 dataset (SSv2) (Goyal et al., 2017), which consists of roughly 170K training videos of humans performing basic actions with everyday objects. The videos contain both nontrivial camera motion and dynamic scene manipulation. From each video, we sub-sample a clip of 64 frames. The model is trained using 3 randomly sampled input frames X to compute the scene representation Z = Enc(X) and 4 randomly sampled target views for reconstruction. Following Sajjadi et al., 2022b, we train using a batch size of 256 for 4M steps. For both control latents, we use a dimensionality of Nc = Nd = 8. We refer to Appendix A.1 for further implementation details. |
| Researcher Affiliation | Collaboration | Maximilian Seitzer 1 Sjoerd van Steenkiste 2 Thomas Kipf 3 Klaus Greff 3 Mehdi S. M. Sajjadi 3 1MPI for Intelligent Systems 2Google Research 3Google Deep Mind |
| Pseudocode | No | The paper describes its methods in prose and does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions a project website 'dyst-paper.github.io' but does not explicitly state that the source code for the methodology is available there or provide a direct link to a code repository. It only promises the dataset will be made public. |
| Open Datasets | Yes | As a source of a real world videos, we use the Something-Something v2 dataset (SSv2) (Goyal et al., 2017), which consists of roughly 170K training videos of humans performing basic actions with everyday objects. |
| Dataset Splits | Yes | The Dy SO dataset consist of 1M scenes for training, 100K scenes for validation and 10K scenes for testing. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as GPU/CPU models or other hardware specifications. |
| Software Dependencies | No | The paper mentions using the Kubric simulator but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | The model is trained using 3 randomly sampled input frames X to compute the scene representation Z = Enc(X) and 4 randomly sampled target views for reconstruction. Following Sajjadi et al., 2022b, we train using a batch size of 256 for 4M steps. For both control latents, we use a dimensionality of Nc = Nd = 8. The model is trained on images of size 128 × 128. During training, we always use 3 input views and 4 target views per scene, where we render 8192 pixels uniformly across the target views. We alternate gradient steps between batches from the Dy SO and the SSv2 dataset, and train the model end-to-end using the Adam optimizer (with β1 = 0.9, β2 = 0.999, ϵ = 10 −8) for 4 M steps. The learning rate is decayed from initially 1 × 10 −4 to 1.6 × 10 −5, with an initial linear warmup in the first 2500 steps. We also clip gradients exceeding a norm of 0.1. As in RUST, we scale the gradients flowing to the camera & dynamics estimator by a factor of 0.2. |