reproducibilityindex.ai

Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction

Authors: Wonkwang Lee, Whie Jung, Han Zhang, Ting Chen, Jing Yu Koh, Thomas Huang, Hyungsuk Yoon, Honglak Lee, Seunghoon Hong

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our method on three challenging datasets involving car driving and human dancing, and demonstrate that it can generate complicated scene structures and motions over a very long time horizon (i.e., thousands frames), setting a new standard of video prediction with orders of magnitude longer prediction time than existing approaches.
Researcher Affiliation	Collaboration	Wonkwang Lee1, Whie Jung1, Han Zhang2, Ting Chen2, Jing Yu Koh2, Thomas Huang3, Hyungsuk Yoon4, Honglak Lee3,5 , Seunghoon Hong1 1KAIST, 2Google Research, 3University of Michigan, 4MOLOCO, 5LG AI Research
Pseudocode	No	The paper provides architectural details in tables (Table B, C, D, E) describing layers and operations. However, these are presented as network specifications, not as a clearly labeled 'Pseudocode' or 'Algorithm' block with structured steps.
Open Source Code	Yes	Full videos and codes are available at https://1konny.github.io/HVP/.
Open Datasets	Yes	To evaluate if our model can learn complex and highly structured motion, we used videos of human dancing (Wang et al., 2018). We construct this dataset by crawling a set of videos from the web containing a single person covering various dance moves. We collect approximately 240 videos in total for training. ... we adopt KITTI dataset (Geiger et al., 2013) ... To evaluate the quality of forecasting dense labels by the structure generator, we employ the Cityscapes (Cordts et al., 2016), a widely-used benchmark in future segmentation tasks.
Dataset Splits	Yes	On the KITTI dataset, we conducted the evaluation over the 133 unique sequences present in the validation set. On the Human Dancing dataset, we conducted evaluation over 97 videos used for validation. ... The dataset consists of 2,975 training, 1,525 testing, and 500 validation sequences where each sequence is 30frame-long and has ground-truth segmentation labels only in the 20th frame.
Hardware Specification	Yes	Table F: Time required to train the model. ... # GPU (V100 16GB)
Software Dependencies	No	The paper mentions 'ADAM optimizer (Kingma & Ba, 2015)' but does not specify version numbers for any programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or other software libraries that would be required to reproduce the experiments.
Experiment Setup	Yes	For hyperparameters, we use C = 5; we set β = 0.0005; we use ADAM optimizer (Kingma & Ba, 2015) with the learning rate of 0.0001 and (β1, β2) = (0.9, 0.999). ... All models are trained to predict 40 future frames given 5 context frames. ... We use τ = 5 for KITTI and Human Dancing and use τ = 4 for Cityscapes. We use τ = 3 for all experiments.