Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning to Generate Long-term Future via Hierarchical Prediction

Authors: Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn, Xunyu Lin, Honglak Lee

ICML 2017 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments, our model is evaluated on the Human3.6M and Penn Action datasets on the task of long-term pixel-level video prediction of humans performing actions and demonstrate significantly better results than the state-of-the-art.
Researcher Affiliation Collaboration 1Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI, USA. 2Adobe Research, San Jose, CA. 3Beihang University, Beijing, China. 4Google Brain, Mountain View, CA.
Pseudocode Yes Algorithm 1 Video Prediction Procedure
Open Source Code No For video illustration of our method, please refer to the project website: https://sites.google.com/a/umich.edu/rubenevillegas/hierch_vid. The provided link is for video illustrations, not explicitly for source code of the methodology.
Open Datasets Yes We present experiments on pixel-level video prediction of human actions on the Penn Action (Weiyu Zhang & Derpanis, 2013) and Human 3.6M datasets (Ionescu et al., 2014).
Dataset Splits Yes To train our image generator, we use the standard train split provided in the dataset. To train our pose predictor, we sub-sample the actions in the standard train-test split due to very noisy joint ground-truth.
Hardware Specification Yes We thank NVIDIA for donating K40c and TITAN X GPUs.
Software Dependencies No The paper mentions models/architectures like Alex Net and VGG16, but does not provide specific version numbers for any software dependencies, libraries, or programming languages used.
Experiment Setup Yes The sequence prediction LSTM is made of a single layer encoder-decoder LSTM with tied parameters, 1024 hidden units, and tanh output activation. The image and pose encoders are built with the same architecture as VGG16 (Simonyan & Zisserman, 2015) up to the pooling layer...Our pose predictor is trained to observe 10 inputs and predict 32 steps, and tested on predicting up to 64 steps (some videos groundtruth end before 64 steps). Our image generator is trained to make single random jumps within 30 steps into the future.