Unsupervised Learning for Physical Interaction through Video Prediction

Authors: Chelsea Finn, Ian Goodfellow, Sergey Levine

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that our proposed method produces more accurate video predictions both quantitatively and qualitatively, when compared to prior methods.
Researcher Affiliation Collaboration Chelsea Finn UC Berkeley cbfinn@eecs.berkeley.edu Ian Goodfellow Open AI ian@openai.com Sergey Levine Google Brain UC Berkeley slevine@google.com Work was done while the author was at Google Brain.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The dataset, video results, and code are all available online: sites.google.com/site/robotprediction.
Open Datasets Yes We collected a new dataset using 10 robotic arms, shown in Figure 2, pushing hundreds of objects in bins, amounting to 57,000 interaction sequences with 1.5 million video frames. The dataset is publically available2. Further details on the data collection procedure are provided in Appendix A. (footnote 2: See http://sites.google.com/site/robotprediction) We also evaluate our model on predicting future video without actions. We chose the Human3.6M dataset, which consists of human actors performing various actions in a room.
Dataset Splits Yes We held out 5% of the training set for validation.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No We trained all models using the Tensor Flow library [1], optimizing to convergence using ADAM [13] with the suggested hyperparameters. The paper mentions TensorFlow but does not specify a version number.
Experiment Setup Yes We trained all models using the Tensor Flow library [1], optimizing to convergence using ADAM [13] with the suggested hyperparameters. We trained for 8 future time steps for all recurrent models, and test for up to 18 time steps. We trained the networks using an l2 reconstruction loss. We trained all models on 5 of the human subjects, held out one subject for validation, and held out a different subject for the evaluations presented here. We subsampled the video down to 10 fps such that there was noticeable motion in the videos within reasonable time frames. Since the model is no longer conditioned on actions, we fed in 10 video frames and trained the network to produce the next 10 frames, corresponding to 1 second each. Our evaluation measures performance up to 20 timesteps into the future.