SCALOR: Generative World Models with Scalable Object Representations
Authors: Jindong Jiang*, Sepehr Janghorbani*, Gerard De Melo, Sungjin Ahn
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we describe the experiments conducted to empirically evaluate the performance of SCALOR. We propose two tasks, (i) synthetic MNIST/d Sprites shapes and (ii) natural-scene CCTV footage of walking pedestrians. We will show SCALOR s abilities to detect and track objects, to generate future trajectories, and to generalize to unseen settings. Furthermore, we provide a quantitative comparison to state-of-the-art baselines. |
| Researcher Affiliation | Academia | Jindong Jiang , Sepehr Janghorbani , Gerard de Melo & Sungjin Ahn Rutgers University |
| Pseudocode | Yes | A ALGORITHMS Algorithm 1: Discovery Proposal-Rejection Inference Algorithm 2: Propagation Inference Algorithm 3: Background Module and Rendering |
| Open Source Code | No | Full details of the architecture will be released along with our code. |
| Open Datasets | Yes | We first evaluate our model on datasets of moving d Sprites shapes as well as moving MNIST digits. ... Specifically, we consider the Crowded Grand Central Station dataset (Zhou et al., 2012)... |
| Dataset Splits | No | For natural-scene experiments, we spatially split the video into 8 parts and create a dataset of 400k frames in total. We choose the first 360k frames for training and 40k frames for testing. No explicit mention of a validation split was found. |
| Hardware Specification | No | No specific hardware details (like GPU/CPU models, memory) were provided for the experimental setup. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., library or solver names with versions) were provided. |
| Experiment Setup | Yes | We choose a batch size of 20 for the natural scene experiments and a batch size of 16 for MNIST/d Sprites experiments. The learning rate is fixed at 4e-5 for natural image experiments and 5e-4 for d Sprites/MNIST experiments. We use RMSprop for optimization during training. The standard deviation of the image distribution is chosen to be 0.1 for natural experiments and 0.2 for toy experiments. The prior for all Gaussian posteriors is set to standard normal. For the pedestrian tracking dataset, we constrain the range of zscale so that the inferred width can vary from 5.2 pixels to 11.7 pixels, and the height can vary from 12.0 to 28.8, and both with a prior of the middle value in discovery. Similarly, we constrain zscale on synthetic datasets so that it can vary from half to 1.5 times the actual object size. The zpos variable in the propagation phase is modeled as the deviation of the position from the previous time-step instead of the global coordinate. The prior for zpres in discovery is set to be 0.1 at the beginning of training and to quickly anneal to 1e-3 for natural image experiments and 1e-4 for d Sprites/MNIST experiments. The temperature used for modelling zpres is set to be 1.0 at the beginning and anneal to 0.3 after 20k iterations. |