Learning to Generate Long-term Future via Hierarchical Prediction
Authors: Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn, Xunyu Lin, Honglak Lee
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments, our model is evaluated on the Human3.6M and Penn Action datasets on the task of long-term pixel-level video prediction of humans performing actions and demonstrate significantly better results than the state-of-the-art. |
| Researcher Affiliation | Collaboration | 1Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI, USA. 2Adobe Research, San Jose, CA. 3Beihang University, Beijing, China. 4Google Brain, Mountain View, CA. |
| Pseudocode | Yes | Algorithm 1 Video Prediction Procedure |
| Open Source Code | No | For video illustration of our method, please refer to the project website: https://sites.google.com/a/umich.edu/rubenevillegas/hierch_vid. The provided link is for video illustrations, not explicitly for source code of the methodology. |
| Open Datasets | Yes | We present experiments on pixel-level video prediction of human actions on the Penn Action (Weiyu Zhang & Derpanis, 2013) and Human 3.6M datasets (Ionescu et al., 2014). |
| Dataset Splits | Yes | To train our image generator, we use the standard train split provided in the dataset. To train our pose predictor, we sub-sample the actions in the standard train-test split due to very noisy joint ground-truth. |
| Hardware Specification | Yes | We thank NVIDIA for donating K40c and TITAN X GPUs. |
| Software Dependencies | No | The paper mentions models/architectures like Alex Net and VGG16, but does not provide specific version numbers for any software dependencies, libraries, or programming languages used. |
| Experiment Setup | Yes | The sequence prediction LSTM is made of a single layer encoder-decoder LSTM with tied parameters, 1024 hidden units, and tanh output activation. The image and pose encoders are built with the same architecture as VGG16 (Simonyan & Zisserman, 2015) up to the pooling layer...Our pose predictor is trained to observe 10 inputs and predict 32 steps, and tested on predicting up to 64 steps (some videos groundtruth end before 64 steps). Our image generator is trained to make single random jumps within 30 steps into the future. |