Moving Off-the-Grid: Scene-Grounded Video Representations
Authors: Sjoerd van Steenkiste, Daniel Zoran, Yi Yang, Yulia Rubanova, Rishabh Kabra, Carl Doersch, Dilara Gokay, joseph heyward, Etienne Pot, Klaus Greff, Drew Hudson, Thomas Keck, Joao Carreira, Alexey Dosovitskiy, Mehdi S. M. Sajjadi, Thomas Kipf
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments We begin by qualitatively investigating properties of the learned OTG representation. We trained Moo G with 1024 512-dimensional OTG tokens on natural videos from the Ego4D dataset [21] and Kinetics700 dataset [6] using only the self-supervised prediction loss in (4). ... To make a quantitative assessment, we propose general readout decoders that support a variety of downstream tasks. We distinguish between two types of readouts: grid-based readouts (e.g. RGB or depth pixels) and tracking-based readouts (e.g. 2D points or object tracking). |
| Researcher Affiliation | Industry | Sjoerd van Steenkiste ,1, Daniel Zoran ,2, Yi Yang2, Yulia Rubanova2, Rishabh Kabra2, Carl Doersch2, Dilara Gokay2, Joseph Heyward2, Etienne Pot2, Klaus Greff2, Drew A. Hudson2, Thomas Albert Keck2, Joao Carreira2, Alexey Dosovitskiy ,3, Mehdi S. M. Sajjadi2, Thomas Kipf ,2 1Google Research, 2Google Deep Mind, 3Inceptive |
| Pseudocode | No | The paper contains architectural diagrams and descriptions of algorithms but no structured pseudocode or algorithm blocks with specific labels like "Pseudocode" or "Algorithm X". |
| Open Source Code | No | The data we use is publicly accessible and we intent to release the main model code upon acceptance of the paper. |
| Open Datasets | Yes | We trained Moo G with 1024 512-dimensional OTG tokens on natural videos from the Ego4D dataset [21] and Kinetics700 dataset [6]... We train Moo G on Kubric MOVi-E [22]... The Waymo Open dataset [53] contains high-resolution videos... |
| Dataset Splits | Yes | The training set contains 97500 videos and the validation sets 250 videos, each of length 24. |
| Hardware Specification | Yes | Our Moo G runs make use of 64 TPUv3 [39] chips having 32GiB memory, which each take about 48 hours for 1M steps. |
| Software Dependencies | No | We implemented Moo G in JAX [3] using Flax [24]. The paper cites JAX and Flax but does not specify their version numbers or other library version numbers. |
| Experiment Setup | Yes | We train Moo G on raw video data (see below for datasets used) for 1M steps using Adam with Nesterov momentum [16, 31] using a cosine decay schedule that includes a linear warm-up for 1000 steps, a peak value of 1e-4 and an end value of 1e-7. Updates are clipped using a maximum global norm of 1.0, and we use β1 = 0.9, β2 = 0.95 inside Adam. We use a batch size of 128 for most of our experiments, and a batch size of 256 for the comparison to domain-specific baselines in Tables 3 & 4. |