Eidetic 3D LSTM: A Model for Video Prediction and Beyond

Authors: Yunbo Wang, Lu Jiang, Ming-Hsuan Yang, Li-Jia Li, Mingsheng Long, Li Fei-Fei

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We first evaluate the E3D-LSTM network on widely-used future video prediction datasets and achieve the state-of-the-art performance. Then we show that the E3D-LSTM network also performs well on the early activity recognition to infer what is happening or what will happen after observing only limited frames of video. We present ablation studies to verify the effectiveness of all modules in the proposed E3D-LSTM model.
Researcher Affiliation Collaboration Yunbo Wang1, Lu Jiang2, Ming-Hsuan Yang2,3, Li-Jia Li4, Mingsheng Long1, Li Fei-Fei4 1Tsinghua University, 2Google AI, 3University of California, Merced, 4Stanford University
Pseudocode No The paper presents equations and diagrams illustrating the model architecture and memory transitions, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes The source code and trained models will be made available to the public.
Open Datasets Yes The moving MNIST dataset is constructed by randomly sampling two digits from the original MNIST dataset... The KTH action dataset (Schuldt et al., 2004) contains 25 individuals... The Taxi BJ dataset is collected from the chaotic real-world environment using GPS monitors of taxicabs Beijing... The something-something dataset (Goyal et al., 2017) is a recent benchmark for activity/action recognition (https://20bn.com/datasets/something-something).
Dataset Splits Yes The whole dataset has a fixed number of entries, 10,000 sequences for training, 3,000 for validation and 5,000 for test [Moving MNIST]. The something-something dataset... contains 56,769 short videos for the training set and 7,503 videos for the validation set on 41 action categories.
Hardware Specification No The paper states that experiments were conducted using TensorFlow and trained with the ADAM optimizer, but it does not provide any specific hardware details such as GPU or CPU models, memory, or cluster specifications.
Software Dependencies No All experiments are conducted using TensorFlow (Abadi et al., 2016) and trained with the ADAM optimizer (Kingma & Ba, 2015). No specific version numbers for TensorFlow or other software dependencies are provided.
Experiment Setup Yes All experiments are conducted using TensorFlow (Abadi et al., 2016) and trained with the ADAM optimizer (Kingma & Ba, 2015) to minimize the l1 + l2 loss over every pixel in the frame... We stack 4 E3D-LSTMs... The number of hidden state channels of each E3D-LSTM is 64. The temporal stride is set to 1... We use the architecture illustrated in Figure 1(c) as our model, which consists of 2 layers of 3D-CNN encoders, 4 layers of E3D-LSTMs, and 2 layers of 3D-CNN decoders... We set λ(i) in Equation 5 to 10 in the beginning (i = 0), and decrease it with a speed of 2e-5 per iteration, lower bounded by η = 0.1.