Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction
Authors: Wonkwang Lee, Whie Jung, Han Zhang, Ting Chen, Jing Yu Koh, Thomas Huang, Hyungsuk Yoon, Honglak Lee, Seunghoon Hong
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method on three challenging datasets involving car driving and human dancing, and demonstrate that it can generate complicated scene structures and motions over a very long time horizon (i.e., thousands frames), setting a new standard of video prediction with orders of magnitude longer prediction time than existing approaches. |
| Researcher Affiliation | Collaboration | Wonkwang Lee1, Whie Jung1, Han Zhang2, Ting Chen2, Jing Yu Koh2, Thomas Huang3, Hyungsuk Yoon4, Honglak Lee3,5 , Seunghoon Hong1 1KAIST, 2Google Research, 3University of Michigan, 4MOLOCO, 5LG AI Research |
| Pseudocode | No | The paper provides architectural details in tables (Table B, C, D, E) describing layers and operations. However, these are presented as network specifications, not as a clearly labeled 'Pseudocode' or 'Algorithm' block with structured steps. |
| Open Source Code | Yes | Full videos and codes are available at https://1konny.github.io/HVP/. |
| Open Datasets | Yes | To evaluate if our model can learn complex and highly structured motion, we used videos of human dancing (Wang et al., 2018). We construct this dataset by crawling a set of videos from the web containing a single person covering various dance moves. We collect approximately 240 videos in total for training. ... we adopt KITTI dataset (Geiger et al., 2013) ... To evaluate the quality of forecasting dense labels by the structure generator, we employ the Cityscapes (Cordts et al., 2016), a widely-used benchmark in future segmentation tasks. |
| Dataset Splits | Yes | On the KITTI dataset, we conducted the evaluation over the 133 unique sequences present in the validation set. On the Human Dancing dataset, we conducted evaluation over 97 videos used for validation. ... The dataset consists of 2,975 training, 1,525 testing, and 500 validation sequences where each sequence is 30frame-long and has ground-truth segmentation labels only in the 20th frame. |
| Hardware Specification | Yes | Table F: Time required to train the model. ... # GPU (V100 16GB) |
| Software Dependencies | No | The paper mentions 'ADAM optimizer (Kingma & Ba, 2015)' but does not specify version numbers for any programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or other software libraries that would be required to reproduce the experiments. |
| Experiment Setup | Yes | For hyperparameters, we use C = 5; we set β = 0.0005; we use ADAM optimizer (Kingma & Ba, 2015) with the learning rate of 0.0001 and (β1, β2) = (0.9, 0.999). ... All models are trained to predict 40 future frames given 5 context frames. ... We use τ = 5 for KITTI and Human Dancing and use τ = 4 for Cityscapes. We use τ = 3 for all experiments. |