Learning to Generate Long-term Future via Hierarchical Prediction

Authors: Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn, Xunyu Lin, Honglak Lee

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments, our model is evaluated on the Human3.6M and Penn Action datasets on the task of long-term pixel-level video prediction of humans performing actions and demonstrate significantly better results than the state-of-the-art.
Researcher Affiliation Collaboration 1Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI, USA. 2Adobe Research, San Jose, CA. 3Beihang University, Beijing, China. 4Google Brain, Mountain View, CA.
Pseudocode Yes Algorithm 1 Video Prediction Procedure
Open Source Code No For video illustration of our method, please refer to the project website: https://sites.google.com/a/umich.edu/rubenevillegas/hierch_vid. The provided link is for video illustrations, not explicitly for source code of the methodology.
Open Datasets Yes We present experiments on pixel-level video prediction of human actions on the Penn Action (Weiyu Zhang & Derpanis, 2013) and Human 3.6M datasets (Ionescu et al., 2014).
Dataset Splits Yes To train our image generator, we use the standard train split provided in the dataset. To train our pose predictor, we sub-sample the actions in the standard train-test split due to very noisy joint ground-truth.
Hardware Specification Yes We thank NVIDIA for donating K40c and TITAN X GPUs.
Software Dependencies No The paper mentions models/architectures like Alex Net and VGG16, but does not provide specific version numbers for any software dependencies, libraries, or programming languages used.
Experiment Setup Yes The sequence prediction LSTM is made of a single layer encoder-decoder LSTM with tied parameters, 1024 hidden units, and tanh output activation. The image and pose encoders are built with the same architecture as VGG16 (Simonyan & Zisserman, 2015) up to the pooling layer...Our pose predictor is trained to observe 10 inputs and predict 32 steps, and tested on predicting up to 64 steps (some videos groundtruth end before 64 steps). Our image generator is trained to make single random jumps within 30 steps into the future.