Unsupervised Discovery of 3D Physical Objects from Video
Authors: Yilun Du, Kevin A. Smith, Tomer Ullman, Joshua B. Tenenbaum, Jiajun Wu
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test how well POD-Net performs image segmentation and object discovery on two datasets: one made from Shape Net objects (Chang et al., 2015), and one from real-world images. We find that POD-Net outperforms recent self-supervised image segmentation models that use regular foreground-background relationships (Greff et al., 2019) or assume that images are composable into object-like parts (Burgess et al., 2019). Finally, we show that the representations learned by POD-Net can be used to support reasoning in a task that requires identifying scenes with physically implausible events (Smith et al., 2019). Together, this demonstrates that using motion as a grouping cue to constrain the learning of object segmentations and representations achieves both goals: it produces better image segmentations and learns scene representations that are useful for physical reasoning. |
| Researcher Affiliation | Academia | Yilun Du MIT Kevin Smith MIT Tomer Ulman Harvard University Joshua Tenenbaum MIT Jiajun Wu Stanford University |
| Pseudocode | No | The paper does not include any clearly labeled pseudocode or algorithm blocks. Figure 2 presents a system diagram, not an algorithm. |
| Open Source Code | No | The paper provides a "Project page: https://yilundu.github.io/podnet". This is a general project page, not a direct link to a source-code repository for the methodology described in the paper, nor does it explicitly state that the code is hosted there. |
| Open Datasets | Yes | Data. To train models on moving Shape Net objects, we use the generation code provided in the ADEPT dataset in Smith et al. (2019). We use the dataset in Lerer et al. (2016) with 492 videos of real block towers, which may or may not be falling. |
| Dataset Splits | No | The paper states: "We generate a training set of 1,000 videos, each 100 frames long..." and mentions evaluation on datasets, but it does not specify explicit percentages or counts for training/validation/test splits within the text. It implies the use of a training set, but does not provide details on how the data was partitioned for validation or testing beyond stating the datasets used for evaluation. |
| Hardware Specification | No | The paper does not explicitly specify any hardware used for running its experiments, such as specific GPU or CPU models, memory details, or cloud instance types. |
| Software Dependencies | No | The paper mentions: "We use the RMSprop optimizer with a learning rate of 10 4 within the Py Torch framework (Paszke et al., 2019) to train our models." While PyTorch is mentioned, a specific version number is not provided, which is required for reproducibility. |
| Experiment Setup | Yes | We use the RMSprop optimizer with a learning rate of 10 4 within the Py Torch framework (Paszke et al., 2019) to train our models. After qualitatively observing object-like masks (roughly after 100,000 iterations), we switch to maximizing the likelihood of the model under both the generation and physical plausibility objectives. Videos have a resolution of 1024 1024 pixels. We apply our model with a patch size of 256 256. We use a residual architecture (He et al., 2015) for the attention and VAE components. We train a recurrent model with a total of 5 slots for each image. |