Conditional Object-Centric Learning from Video
Authors: Thomas Kipf, Gamaleldin Fathy Elsayed, Aravindh Mahendran, Austin Stone, Sara Sabour, Georg Heigold, Rico Jonschkowski, Alexey Dosovitskiy, Klaus Greff
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We investigate 1) how SAVi compares to existing unsupervised video decomposition methods, 2) how various forms of hints (e.g., bounding boxes) can facilitate scene decomposition, and 3) how SAVi generalizes to unseen objects, backgrounds, and longer videos at test time.Metrics We report two metrics to measure the quality of video decomposition, object segmentation, and tracking: Adjusted Rand Index (ARI) and mean Intersection over Union (mIoU). |
| Researcher Affiliation | Industry | Thomas Kipf , Gamaleldin F. Elsayed , Aravindh Mahendran , Austin Stone , Sara Sabour, Georg Heigold, Rico Jonschkowski, Alexey Dosovitskiy & Klaus Greff Google Research |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Project page: https://slot-attention-video.github.io/ |
| Open Datasets | Yes | The Kubric (Greff et al., 2021) dataset generation pipeline is publicly available under an Apache 2.0 license. MOVi++ contains approx. 380 publicly available CC-0 licensed HDR backgrounds from https://hdrihaven.com/. The data does not contain personally identifiable information or offensive content. The original CATER (Girdhar & Ramanan, 2019) dataset (without segmentation mask annotations) is publicly available under an Apache 2.0 license. |
| Dataset Splits | Yes | Each dataset contains 9000 training videos and 1000 validation videos with 24 frames at 12 fps each and, unless otherwise mentioned, a resolution of 64 64 pixels for MOVi and 128 128 pixels for MOVi++. |
| Hardware Specification | Yes | On 8x V100 GPUs with 16GB memory each, training SAVi with bounding box conditioning takes approx. 12hrs for videos with 64 64 resolution and 30hrs for videos with 128 128 resolution. We train our models on TPU v3 hardware. |
| Software Dependencies | No | We implement both SAVi and the T-VOS baseline in JAX (Bradbury et al., 2018) using the Flax (Heek et al., 2020) neural network library. We train our models on TPU v3 hardware. (Does not specify version numbers for JAX or Flax, only the year of their respective publications). |
| Experiment Setup | Yes | Training setup During training, we split each video into consecutive sub-sequences of 6 frames each, where we provide the conditioning signal for the first frame. We train for 100k steps (200k for fully unsupervised video decomposition) with a batch size of 64 using Adam (Kingma & Ba, 2015) with a base learning rate of 2 10 4. We use a total of 11 slots in SAVi. For our experiments on fully unsupervised video decomposition we use 2 iterations of Slot Attention per frame and a single iteration otherwise. Other hyperparameters As described in the main paper, we train for 100k steps with a batch size of 64 using the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 2 10 4 and gradient clipping with a maximum norm of 0.05. Like in previous work (Locatello et al., 2020), we use learning rate warmup and learning rate decay. We linearly warm up the learning rate for 2.5k steps and we use cosine annealing (Loshchilov & Hutter, 2017) to decay the learning rate to 0 throughout the course of training. |