reproducibilityindex.ai

Learning to Generate Long-term Future via Hierarchical Prediction

Authors: Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn, Xunyu Lin, Honglak Lee

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments, our model is evaluated on the Human3.6M and Penn Action datasets on the task of long-term pixel-level video prediction of humans performing actions and demonstrate significantly better results than the state-of-the-art.
Researcher Affiliation	Collaboration	1Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI, USA. 2Adobe Research, San Jose, CA. 3Beihang University, Beijing, China. 4Google Brain, Mountain View, CA.
Pseudocode	Yes	Algorithm 1 Video Prediction Procedure
Open Source Code	No	For video illustration of our method, please refer to the project website: https://sites.google.com/a/umich.edu/rubenevillegas/hierch_vid. The provided link is for video illustrations, not explicitly for source code of the methodology.
Open Datasets	Yes	We present experiments on pixel-level video prediction of human actions on the Penn Action (Weiyu Zhang & Derpanis, 2013) and Human 3.6M datasets (Ionescu et al., 2014).
Dataset Splits	Yes	To train our image generator, we use the standard train split provided in the dataset. To train our pose predictor, we sub-sample the actions in the standard train-test split due to very noisy joint ground-truth.
Hardware Specification	Yes	We thank NVIDIA for donating K40c and TITAN X GPUs.
Software Dependencies	No	The paper mentions models/architectures like Alex Net and VGG16, but does not provide specific version numbers for any software dependencies, libraries, or programming languages used.
Experiment Setup	Yes	The sequence prediction LSTM is made of a single layer encoder-decoder LSTM with tied parameters, 1024 hidden units, and tanh output activation. The image and pose encoders are built with the same architecture as VGG16 (Simonyan & Zisserman, 2015) up to the pooling layer...Our pose predictor is trained to observe 10 inputs and predict 32 steps, and tested on predicting up to 64 steps (some videos groundtruth end before 64 steps). Our image generator is trained to make single random jumps within 30 steps into the future.