HAF-SVG: Hierarchical Stochastic Video Generation with Aligned Features

Authors: Zhihui Lin, Chun Yuan, Maomao Li

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on Moving MNIST, BAIR, and KTH datasets demonstrate that hierarchical structure is helpful for modeling more accurate future uncertainty, and the feature aligner is beneficial to generate realistic frames.
Researcher Affiliation Academia 1Department of Computer Science and Technologies, Tsinghua University, Beijing, China 2Graduate School at Shenzhen, Tsinghua University, Shenzhen, China 3Peng Cheng Laboratory, Shenzhen, China
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any concrete access information (e.g., repository link, explicit statement of code release, or mention of code in supplementary materials) for the source code.
Open Datasets Yes We perform experiments on synthetic sequences (Moving MNIST [Srivastava et al., 2015]), as well as real-world videos (KTH action [Schuldt et al., 2004] and BAIR robot [Ebert et al., 2017]).
Dataset Splits No The paper mentions aspects of training and testing, such as 'Each training sequence consists of 15 consecutive frames, 5 for the input and 10 for the prediction' and 'For each sequence, 100 predictions are sampled and one with the best score with respect to the ground-truth', but it does not specify explicit dataset splits (e.g., percentages or counts) for training, validation, and testing as distinct sets.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions using VGG16 and DCGAN architectures, and LSTM layers, but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We adopt the experiment setup in SVG [Denton and Fergus, 2018], where frames are all resized into 64 64. LSTMθ is implemented by a two-layer LSTMs with 256 cells in each layer while LSTMφj and LSTMψj are single-layer LSTMs with 256 cells. The output dimensionalities of the LSTM networks are 128 and |ht| = 128 for all three datasets. For KTH and BAIR, the encoder E adopts the VGG16 [Simonyan and Zisserman, 2015] architecture, and the frame decoder D is the mirrored version of the encoder. |µφj| = |µψj| are set to 24 on KTH, while 64 on BAIR. For Moving-MNIST, we adopt the DCGAN discriminator architecture [Radford et al., 2016] as our E, the DCGAN generator architecture as D, and |µφj| = |µψj| = 16. Besides, we use β=1e-4 for Moving MNIST and BAIR and β=1e-6 for KTH.