Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning

Authors: William Lotter, Gabriel Kreiman, David Cox

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here, we explore prediction of future frames in a video sequence as an unsupervised learning rule for learning about the structure of the visual world. We show that these networks are able to robustly learn to predict the movement of synthetic (rendered) objects, and that in doing so, the networks learn internal representations that are useful for decoding latent object parameters (e.g. pose) that support object recognition with fewer training views. We also show that these networks can scale to complex natural image streams (car-mounted camera videos), capturing key aspects of both egocentric movement and the movement of objects in the visual scene, and the representation learned in this setting is useful for estimating the steering angle. Altogether, these results suggest that prediction represents a powerful framework for unsupervised learning, allowing for implicit learning of object and scene structure.
Researcher Affiliation Academia William Lotter, Gabriel Kreiman & David Cox Harvard University Cambridge, MA 02215, USA {lotter,davidcox}@fas.harvard.edu gabriel.kreiman@tch.harvard.edu
Pseudocode Yes Algorithm 1 Calculation of Pred Net states
Open Source Code Yes Code and video examples can be found at: https://coxlab.github.io/prednet/
Open Datasets Yes Models were trained using the raw videos from the KITTI dataset (Geiger et al., 2013) [...] We tested on the Cal Tech Pedestrian dataset (Doll ar et al., 2009) [...] We used a dataset released by Comma.ai (Biasini et al., 2016) [...] Human3.6M (Ionescu et al., 2014) dataset
Dataset Splits Yes We used 16K sequences for training and 800 for both validation and testing. [...] 57 recording sessions used for training and 4 used for validation. [...] we used 5% of each video for validation and testing, chosen as a random continuous chunk, and discarded the 10 frames before and after the chosen segments from the training set.
Hardware Specification No The paper does not specify the hardware used for training or inference of the models. It only mentions "car-mounted camera videos" for data collection.
Software Dependencies No The paper mentions Keras ("We would also like to thank the developers of Keras (Chollet, 2016).") but does not provide a specific version number. It does not list any other software dependencies with version numbers.
Experiment Setup Yes the best performing models tended to have a loss solely concentrated at the lowest layer (i.e. λ0 = 1, λl>0 = 0) [...] the model shown has 5 layers with 3x3 filter sizes for all convolutions, max-pooling of stride 2, and number of channels per layer, for both Al and Rl units, of (1, 32, 64, 128, 256). Model weights were optimized using the Adam algorithm (Kingma & Ba, 2014). [...] a 4 layer model with 3x3 convolutions and layer channel sizes of (3, 48, 96, 192). Models were again trained with Adam [...] Adam parameters were initially set to their default values (α = 0.001, β1 = 0.9, β2 = 0.999) with the learning rate, α, decreasing by a factor of 10 halfway through training.