OBJECT DYNAMICS DISTILLATION FOR SCENE DECOMPOSITION AND REPRESENTATION
Authors: Qu Tang, Xiangyu Zhu, Zhen Lei, Zhaoxiang Zhang
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate ODDN on different downstream tasks as shown in Figure 1. Results show that representations with object dynamics perform better in reasoning task than representations with static properties only, and the relation module endows ODDN with the capability of predicting the future. In addition, we notice that incorporating object dynamics and relations into the basic scene decomposition framework benefits the segmentation and reconstruction quality. In this section, we design experiments in the perspective of representation, prediction, and scene decomposition. In particular, we study how ODDN performs on tasks of video understanding and reasoning, video prediction, reconstruction, and segmentation. |
| Researcher Affiliation | Academia | 1School of Artificial Intelligence, University of Chinese Academy of Sciences 2Institute of Automation, Chinese Academy of Sciences 3Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences {tangqu2020,zhaoxiang.zhang}@ia.ac.cn, {xiangyu.zhu,zlei}@nlpr.ia.ac.cn |
| Pseudocode | Yes | A.1 ALGORITHM Here we detail our algorithm with pseudocode: Algorithm 1: ODDN Pseudocode |
| Open Source Code | Yes | Code is available at https://github.com/tqace/ODDN. |
| Open Datasets | Yes | CLEVRER is a synthetic video dataset of moving and colliding objects. Each video contains 128 frames at resolution 480 320. We pre-train ODDN on the entire CLEVRER training set, in order to promote convergence of our Relation Module, we extract images every 4 frames and ensure that at least one collision event is included forming CLEVRER-collision and we fine-tune ODDN on CLEVRER-collision. |
| Dataset Splits | No | We pre-train ODDN on the entire CLEVRER training set, in order to promote convergence of our Relation Module, we extract images every 4 frames and ensure that at least one collision event is included forming CLEVRER-collision and we fine-tune ODDN on CLEVRER-collision. For testing, we use the validation set which has ground truth masks, we sample 1k sub-clips containing 6 objects, each sub-clip consists of 10 frames. |
| Hardware Specification | Yes | We train our models on 8 Ge Force RTX 3090 GPUs, which takes approximately two days per model. |
| Software Dependencies | No | We use ADAM for all experiments, with a learning rate of 0.0003 and default values for all remaining parameters. |
| Experiment Setup | Yes | We initialize the parameters of the posterior by sampling from U(0.5,0.5). In experiments in prediction tasks, we use a latent dimensionality of 64 and downscale the image into 64 64 after a center-crop preprocess as IODINE and PROVIDE, such that dim(λ) = 128. And in experiments in video reasoning task, we use a latent dimensionality of 16 as ALOE which makes dim(λ) = 32, and downscale the image into 64 96 without crop. The variance of the likelihood is set to σ = 0.3 in all experiments. We keep the default number of iterative refinements at R = 5, and use K = 7 slots for both training and testing. Furthermore, we set β = 100.0 for all experiments. We use ADAM for all experiments, with a learning rate of 0.0003 and default values for all remaining parameters. During training, we gradually increase the number of frames per video, as we have found this to make the optimisation more stable. We train models with sequences of length 8 and the batch size is 12. |