reproducibilityindex.ai

OBJECT DYNAMICS DISTILLATION FOR SCENE DECOMPOSITION AND REPRESENTATION

Authors: Qu Tang, Xiangyu Zhu, Zhen Lei, Zhaoxiang Zhang

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate ODDN on different downstream tasks as shown in Figure 1. Results show that representations with object dynamics perform better in reasoning task than representations with static properties only, and the relation module endows ODDN with the capability of predicting the future. In addition, we notice that incorporating object dynamics and relations into the basic scene decomposition framework benefits the segmentation and reconstruction quality. In this section, we design experiments in the perspective of representation, prediction, and scene decomposition. In particular, we study how ODDN performs on tasks of video understanding and reasoning, video prediction, reconstruction, and segmentation.
Researcher Affiliation	Academia	1School of Artificial Intelligence, University of Chinese Academy of Sciences 2Institute of Automation, Chinese Academy of Sciences 3Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences {tangqu2020,zhaoxiang.zhang}@ia.ac.cn, {xiangyu.zhu,zlei}@nlpr.ia.ac.cn
Pseudocode	Yes	A.1 ALGORITHM Here we detail our algorithm with pseudocode: Algorithm 1: ODDN Pseudocode
Open Source Code	Yes	Code is available at https://github.com/tqace/ODDN.
Open Datasets	Yes	CLEVRER is a synthetic video dataset of moving and colliding objects. Each video contains 128 frames at resolution 480 320. We pre-train ODDN on the entire CLEVRER training set, in order to promote convergence of our Relation Module, we extract images every 4 frames and ensure that at least one collision event is included forming CLEVRER-collision and we fine-tune ODDN on CLEVRER-collision.
Dataset Splits	No	We pre-train ODDN on the entire CLEVRER training set, in order to promote convergence of our Relation Module, we extract images every 4 frames and ensure that at least one collision event is included forming CLEVRER-collision and we fine-tune ODDN on CLEVRER-collision. For testing, we use the validation set which has ground truth masks, we sample 1k sub-clips containing 6 objects, each sub-clip consists of 10 frames.
Hardware Specification	Yes	We train our models on 8 Ge Force RTX 3090 GPUs, which takes approximately two days per model.
Software Dependencies	No	We use ADAM for all experiments, with a learning rate of 0.0003 and default values for all remaining parameters.
Experiment Setup	Yes	We initialize the parameters of the posterior by sampling from U(0.5,0.5). In experiments in prediction tasks, we use a latent dimensionality of 64 and downscale the image into 64 64 after a center-crop preprocess as IODINE and PROVIDE, such that dim(λ) = 128. And in experiments in video reasoning task, we use a latent dimensionality of 16 as ALOE which makes dim(λ) = 32, and downscale the image into 64 96 without crop. The variance of the likelihood is set to σ = 0.3 in all experiments. We keep the default number of iterative refinements at R = 5, and use K = 7 slots for both training and testing. Furthermore, we set β = 100.0 for all experiments. We use ADAM for all experiments, with a learning rate of 0.0003 and default values for all remaining parameters. During training, we gradually increase the number of frames per video, as we have found this to make the optimisation more stable. We train models with sequences of length 8 and the batch size is 12.