reproducibilityindex.ai

Conditional Object-Centric Learning from Video

Authors: Thomas Kipf, Gamaleldin Fathy Elsayed, Aravindh Mahendran, Austin Stone, Sara Sabour, Georg Heigold, Rico Jonschkowski, Alexey Dosovitskiy, Klaus Greff

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We investigate 1) how SAVi compares to existing unsupervised video decomposition methods, 2) how various forms of hints (e.g., bounding boxes) can facilitate scene decomposition, and 3) how SAVi generalizes to unseen objects, backgrounds, and longer videos at test time.Metrics We report two metrics to measure the quality of video decomposition, object segmentation, and tracking: Adjusted Rand Index (ARI) and mean Intersection over Union (mIoU).
Researcher Affiliation	Industry	Thomas Kipf , Gamaleldin F. Elsayed , Aravindh Mahendran , Austin Stone , Sara Sabour, Georg Heigold, Rico Jonschkowski, Alexey Dosovitskiy & Klaus Greff Google Research
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Project page: https://slot-attention-video.github.io/
Open Datasets	Yes	The Kubric (Greff et al., 2021) dataset generation pipeline is publicly available under an Apache 2.0 license. MOVi++ contains approx. 380 publicly available CC-0 licensed HDR backgrounds from https://hdrihaven.com/. The data does not contain personally identiﬁable information or offensive content. The original CATER (Girdhar & Ramanan, 2019) dataset (without segmentation mask annotations) is publicly available under an Apache 2.0 license.
Dataset Splits	Yes	Each dataset contains 9000 training videos and 1000 validation videos with 24 frames at 12 fps each and, unless otherwise mentioned, a resolution of 64 64 pixels for MOVi and 128 128 pixels for MOVi++.
Hardware Specification	Yes	On 8x V100 GPUs with 16GB memory each, training SAVi with bounding box conditioning takes approx. 12hrs for videos with 64 64 resolution and 30hrs for videos with 128 128 resolution. We train our models on TPU v3 hardware.
Software Dependencies	No	We implement both SAVi and the T-VOS baseline in JAX (Bradbury et al., 2018) using the Flax (Heek et al., 2020) neural network library. We train our models on TPU v3 hardware. (Does not specify version numbers for JAX or Flax, only the year of their respective publications).
Experiment Setup	Yes	Training setup During training, we split each video into consecutive sub-sequences of 6 frames each, where we provide the conditioning signal for the ﬁrst frame. We train for 100k steps (200k for fully unsupervised video decomposition) with a batch size of 64 using Adam (Kingma & Ba, 2015) with a base learning rate of 2 10 4. We use a total of 11 slots in SAVi. For our experiments on fully unsupervised video decomposition we use 2 iterations of Slot Attention per frame and a single iteration otherwise. Other hyperparameters As described in the main paper, we train for 100k steps with a batch size of 64 using the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 2 10 4 and gradient clipping with a maximum norm of 0.05. Like in previous work (Locatello et al., 2020), we use learning rate warmup and learning rate decay. We linearly warm up the learning rate for 2.5k steps and we use cosine annealing (Loshchilov & Hutter, 2017) to decay the learning rate to 0 throughout the course of training.