Eidetic 3D LSTM: A Model for Video Prediction and Beyond
Authors: Yunbo Wang, Lu Jiang, Ming-Hsuan Yang, Li-Jia Li, Mingsheng Long, Li Fei-Fei
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first evaluate the E3D-LSTM network on widely-used future video prediction datasets and achieve the state-of-the-art performance. Then we show that the E3D-LSTM network also performs well on the early activity recognition to infer what is happening or what will happen after observing only limited frames of video. We present ablation studies to verify the effectiveness of all modules in the proposed E3D-LSTM model. |
| Researcher Affiliation | Collaboration | Yunbo Wang1, Lu Jiang2, Ming-Hsuan Yang2,3, Li-Jia Li4, Mingsheng Long1, Li Fei-Fei4 1Tsinghua University, 2Google AI, 3University of California, Merced, 4Stanford University |
| Pseudocode | No | The paper presents equations and diagrams illustrating the model architecture and memory transitions, but it does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code and trained models will be made available to the public. |
| Open Datasets | Yes | The moving MNIST dataset is constructed by randomly sampling two digits from the original MNIST dataset... The KTH action dataset (Schuldt et al., 2004) contains 25 individuals... The Taxi BJ dataset is collected from the chaotic real-world environment using GPS monitors of taxicabs Beijing... The something-something dataset (Goyal et al., 2017) is a recent benchmark for activity/action recognition (https://20bn.com/datasets/something-something). |
| Dataset Splits | Yes | The whole dataset has a fixed number of entries, 10,000 sequences for training, 3,000 for validation and 5,000 for test [Moving MNIST]. The something-something dataset... contains 56,769 short videos for the training set and 7,503 videos for the validation set on 41 action categories. |
| Hardware Specification | No | The paper states that experiments were conducted using TensorFlow and trained with the ADAM optimizer, but it does not provide any specific hardware details such as GPU or CPU models, memory, or cluster specifications. |
| Software Dependencies | No | All experiments are conducted using TensorFlow (Abadi et al., 2016) and trained with the ADAM optimizer (Kingma & Ba, 2015). No specific version numbers for TensorFlow or other software dependencies are provided. |
| Experiment Setup | Yes | All experiments are conducted using TensorFlow (Abadi et al., 2016) and trained with the ADAM optimizer (Kingma & Ba, 2015) to minimize the l1 + l2 loss over every pixel in the frame... We stack 4 E3D-LSTMs... The number of hidden state channels of each E3D-LSTM is 64. The temporal stride is set to 1... We use the architecture illustrated in Figure 1(c) as our model, which consists of 2 layers of 3D-CNN encoders, 4 layers of E3D-LSTMs, and 2 layers of 3D-CNN decoders... We set λ(i) in Equation 5 to 10 in the beginning (i = 0), and decrease it with a speed of 2e-5 per iteration, lower bounded by η = 0.1. |