reproducibilityindex.ai

Moving Off-the-Grid: Scene-Grounded Video Representations

Authors: Sjoerd van Steenkiste, Daniel Zoran, Yi Yang, Yulia Rubanova, Rishabh Kabra, Carl Doersch, Dilara Gokay, joseph heyward, Etienne Pot, Klaus Greff, Drew Hudson, Thomas Keck, Joao Carreira, Alexey Dosovitskiy, Mehdi S. M. Sajjadi, Thomas Kipf

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 Experiments We begin by qualitatively investigating properties of the learned OTG representation. We trained Moo G with 1024 512-dimensional OTG tokens on natural videos from the Ego4D dataset [21] and Kinetics700 dataset [6] using only the self-supervised prediction loss in (4). ... To make a quantitative assessment, we propose general readout decoders that support a variety of downstream tasks. We distinguish between two types of readouts: grid-based readouts (e.g. RGB or depth pixels) and tracking-based readouts (e.g. 2D points or object tracking).
Researcher Affiliation	Industry	Sjoerd van Steenkiste ,1, Daniel Zoran ,2, Yi Yang2, Yulia Rubanova2, Rishabh Kabra2, Carl Doersch2, Dilara Gokay2, Joseph Heyward2, Etienne Pot2, Klaus Greff2, Drew A. Hudson2, Thomas Albert Keck2, Joao Carreira2, Alexey Dosovitskiy ,3, Mehdi S. M. Sajjadi2, Thomas Kipf ,2 1Google Research, 2Google Deep Mind, 3Inceptive
Pseudocode	No	The paper contains architectural diagrams and descriptions of algorithms but no structured pseudocode or algorithm blocks with specific labels like "Pseudocode" or "Algorithm X".
Open Source Code	No	The data we use is publicly accessible and we intent to release the main model code upon acceptance of the paper.
Open Datasets	Yes	We trained Moo G with 1024 512-dimensional OTG tokens on natural videos from the Ego4D dataset [21] and Kinetics700 dataset [6]... We train Moo G on Kubric MOVi-E [22]... The Waymo Open dataset [53] contains high-resolution videos...
Dataset Splits	Yes	The training set contains 97500 videos and the validation sets 250 videos, each of length 24.
Hardware Specification	Yes	Our Moo G runs make use of 64 TPUv3 [39] chips having 32GiB memory, which each take about 48 hours for 1M steps.
Software Dependencies	No	We implemented Moo G in JAX [3] using Flax [24]. The paper cites JAX and Flax but does not specify their version numbers or other library version numbers.
Experiment Setup	Yes	We train Moo G on raw video data (see below for datasets used) for 1M steps using Adam with Nesterov momentum [16, 31] using a cosine decay schedule that includes a linear warm-up for 1000 steps, a peak value of 1e-4 and an end value of 1e-7. Updates are clipped using a maximum global norm of 1.0, and we use β1 = 0.9, β2 = 0.95 inside Adam. We use a batch size of 128 for most of our experiments, and a batch size of 256 for the comparison to domain-speciﬁc baselines in Tables 3 & 4.